通用
GPU 编程简介
AN INTRODUCTION TO
GENERAL-PURPOSE
GPU PROGRAMMING
新泽西州上萨德尔河 • 波士顿 • 印第安纳波利斯 • 旧金山
纽约 • 多伦多 • 蒙特利尔 • 伦敦 • 慕尼黑 • 巴黎 • 马德里
开普敦 • 悉尼 • 东京 • 新加坡 • 墨西哥城
Upper Saddle River, NJ • Boston • Indianapolis • San Francisco
New York • Toronto • Montreal • London • Munich • Paris • Madrid
Capetown • Sydney • Tokyo • Singapore • Mexico City
制造商和销售商用来区分其产品的许多名称都被称为商标。如果这些名称出现在本书中,并且出版商知道商标声明,则这些名称均以首字母大写或全部大写印刷。
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals.
作者和出版商在准备本书的过程中非常谨慎,但没有做出任何形式的明示或暗示的保证,并且对错误或遗漏不承担任何责任。对于因使用此处包含的信息或程序而产生的偶然或间接损害,我们不承担任何责任。
The authors and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein.
NVIDIA 不保证或声明此处描述的技术不存在任何知识产权声明。读者根据自己对这些技术的使用承担任何此类声明的所有风险。
NVIDIA makes no warranty or representation that the techniques described herein are free from any Intellectual Property claims. The reader assumes all risk of any such claims based on his or her use of these techniques.
当批量购买或特价销售时,出版商会提供极好的折扣,其中可能包括电子版本和/或定制封面以及针对您的业务、培训目标、营销重点和品牌兴趣的内容。获取更多资讯,请联系:
The publisher offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales, which may include electronic versions and/or custom covers and content particular to your business, training goals, marketing focus, and branding interests. For more information, please contact:
美国企业和政府销售
(800) 382-3419
corpsales@pearsontechgroup.com
U.S. Corporate and Government Sales
(800) 382-3419
corpsales@pearsontechgroup.com
对于美国境外的销售,请联系:
For sales outside the United States, please contact:
International Sales
international@pearson.com
请访问我们的网站:informit.com/aw
Visit us on the Web: informit.com/aw
美国国会图书馆出版数据编目
Library of Congress Cataloging-in-Publication Data
桑德斯、杰森.
Sanders, Jason.
CUDA 示例:通用 GPU 编程简介 / Jason Sanders、Edward Kandrot。
CUDA by example : an introduction to general-purpose GPU programming / Jason Sanders, Edward Kandrot.
p。厘米。
p. cm.
包括索引。
ISBN 978-0-13-138768-3(pbk.:alk.论文)
1. 应用软件—开发。 2.计算机体系结构。 3.
Includes index.
ISBN 978-0-13-138768-3 (pbk. : alk. paper)
1. Application software—Development. 2. Computer architecture. 3.
并行编程(计算机科学)I. Kandrot,Edward。二.标题。
Parallel programming (Computer science) I. Kandrot, Edward. II. Title.
QA76.76.A65S255 2010
005.2'75—dc22
QA76.76.A65S255 2010
005.2’75—dc22
2010017618
2010017618
版权所有 © 2011 NVIDIA 公司
Copyright © 2011 NVIDIA Corporation
版权所有。美国印刷。本出版物受版权保护,在进行任何禁止复制、存储在检索系统中或以任何形式或方式(电子、机械、复印、记录或类似方式)进行传输之前,必须获得出版商的许可。有关权限的信息,请写信至:
All rights reserved. Printed in the United States of America. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding permissions, write to:
Pearson Education, Inc.
权利与合同部
501 Boylston Street, Suite 900
Boston, MA 02116
传真:(617) 671-3447
Pearson Education, Inc.
Rights and Contracts Department
501 Boylston Street, Suite 900
Boston, MA 02116
Fax: (617) 671-3447
ISBN-13: 978-0-13-138768-3
ISBN-10: 0-13-138768-5
文本在美国密歇根州安娜堡的 Edwards Brothers 的再生纸上印刷。
首次印刷,2010 年 7 月
ISBN-13: 978-0-13-138768-3
ISBN-10: 0-13-138768-5
Text printed in the United States on recycled paper at Edwards Brothers in Ann Arbor, Michigan.
First printing, July 2010
1.2 The Age of Parallel Processing
1.2.1 Central Processing Units
1.4.1 What Is the CUDA Architecture?
1.4.2 Using the CUDA Architecture
1.5.2 Computational Fluid Dynamics
2.2.1 CUDA-Enabled Graphics Processors
2.2.3 CUDA Development Toolkit
4 PARALLEL PROGRAMMING IN CUDA C
5.2.2 GPU Ripple Using Threads
5.3 Shared Memory and Synchronization
5.3.2 Dot Product Optimized (Incorrectly)
6.2.1 Ray Tracing Introduction
6.2.3 Ray Tracing with Constant Memory
6.2.4 Performance with Constant Memory
6.3 Measuring Performance with Events
6.3.1 Measuring Ray Tracer Performance
7.3.2 Computing Temperature Updates
7.3.3 Animating the Simulation
7.3.5 Using Two-Dimensional Texture Memory
8.3 GPU Ripple with Graphics Interoperability
8.3.1 The GPUAnimBitmap Structure
8.4 Heat Transfer with Graphics Interop
9.2.1 The Compute Capability of NVIDIA GPUs
9.2.2 Compiling for a Minimum Compute Capability
9.3 Atomic Operations Overview
9.4.1 CPU Histogram Computation
9.4.2 GPU Histogram Computation
10.4 Using a Single CUDA Stream
10.5 Using Multiple CUDA Streams
10.7 Using Multiple CUDA Streams Effectively
12.2.4 NVIDIA GPU Computing SDK
12.2.5 NVIDIA Performance Primitives
12.3.1 Programming Massively Parallel Processors: A Hands-On Approach
12.4.1 CUDA Data Parallel Primitives Library
A.1.2 Dot Product Redux: Atomic Locks
NVIDIA 等主要芯片制造商最近的活动比以往任何时候都更加明显地表明,微处理器和大型 HPC 系统的未来设计本质上将是混合/异构的。这些异构系统将依赖于两种主要类型组件以不同比例的集成:
Recent activities of major chip manufacturers such as NVIDIA make it more evident than ever that future designs of microprocessors and large HPC systems will be hybrid/heterogeneous in nature. These heterogeneous systems will rely on the integration of two major types of components in varying proportions:
•多核和众核CPU 技术:由于希望在芯片上封装越来越多的组件,同时避免电源墙、指令级并行墙和内存墙,因此核心数量将继续增加。
• Multi- and many-core CPU technology: The number of cores will continue to escalate because of the desire to pack more and more components on a chip while avoiding the power wall, the instruction-level parallelism wall, and the memory wall.
•专用硬件和大规模并行加速器:例如,近年来,NVIDIA 的GPU 在浮点性能方面已经超过了标准CPU。此外,可以说它们的编程变得与多核 CPU 一样容易,甚至更容易。
• Special-purpose hardware and massively parallel accelerators: For example, GPUs from NVIDIA have outpaced standard CPUs in floating-point performance in recent years. Furthermore, they have arguably become as easy, if not easier, to program than multicore CPUs.
未来设计中这些组件类型之间的相对平衡尚不清楚,并且可能会随着时间的推移而变化。毫无疑问,从笔记本电脑到超级计算机的未来几代计算机系统将由异构组件组成。事实上,这样的系统突破了petaflop(每秒 10 15 次浮点运算)性能障碍。
The relative balance between these component types in future designs is not clear and will likely vary over time. There seems to be no doubt that future generations of computer systems, ranging from laptops to supercomputers, will consist of a composition of heterogeneous components. Indeed, the petaflop (1015 floating-point operations per second) performance barrier was breached by such a system.
然而,在混合处理器的新计算环境中,开发人员面临的问题和挑战仍然令人望而生畏。软件基础设施的关键部分已经很难跟上变化的步伐。在某些情况下,性能无法随着核心数量的增加而扩展,因为越来越多的时间花费在数据移动而不是算术上。在其他情况下,针对性能进行调整的软件是在硬件到达数年后才交付的,因此在交付时就已经过时了。在某些情况下,例如在最近的一些 GPU 上,软件根本无法运行,因为编程环境发生了太大变化。
And yet the problems and the challenges for developers in the new computational landscape of hybrid processors remain daunting. Critical parts of the software infrastructure are already having a very difficult time keeping up with the pace of change. In some cases, performance cannot scale with the number of cores because an increasingly large portion of time is spent on data movement rather than arithmetic. In other cases, software tuned for performance is delivered years after the hardware arrives and so is obsolete on delivery. And in some cases, as on some recent GPUs, software will not run at all because programming environments have changed too much.
CUDA 示例通过利用近年来最具创新性和最强大的解决方案之一来解决大规模并行加速器编程问题,解决了软件开发挑战的核心问题。
CUDA by Example addresses the heart of the software development challenge by leveraging one of the most innovative and powerful solutions to the problem of programming the massively parallel accelerators in recent years.
本书通过提供示例并深入了解构建和有效使用 NVIDIA GPU 的过程,向您介绍 CUDA C 编程。它介绍了并行计算的介绍性概念,从简单的示例到调试(逻辑和性能),并涵盖了与使用和构建许多应用程序相关的高级主题和问题。整本书中的编程示例强化了所介绍的概念。
This book introduces you to programming in CUDA C by providing examples and insight into the process of constructing and effectively using NVIDIA GPUs. It presents introductory concepts of parallel computing from simple examples to debugging (both logical and performance), as well as covers advanced topics and issues related to using and building many applications. Throughout the book, programming examples reinforce the concepts that have been presented.
任何使用基于加速器的计算系统的人都需要阅读这本书。它深入探讨了并行计算,并提供了解决可能遇到的许多问题的方法。它对于应用程序开发人员、数值库编写者以及并行计算的学生和教师特别有用。
The book is required reading for anyone working with accelerator-based computing systems. It explores parallel computing in depth and provides an approach to many problems that may be encountered. It is especially useful for application developers, numerical library writers, and students and teachers of parallel computing.
我很喜欢这本书,也从这本书中学到了很多东西,我相信你也会喜欢。
I have enjoyed and learned from this book, and I feel confident that you will as well.
Jack Dongarra
大学杰出教授、田纳西大学杰出研究人员、橡树岭国家实验室
Jack Dongarra
University Distinguished Professor, University of Tennessee Distinguished Research Staff Member, Oak Ridge National Laboratory
本书展示了如何通过利用计算机图形处理单元(GPU)的强大功能,为各种应用程序编写高性能软件。尽管最初设计用于在显示器上渲染计算机图形(并且仍然用于此目的),但 GPU 越来越多地被用于科学、工程和金融等领域的同等要求的程序。我们将解决非图形领域问题的 GPU 程序统称为通用。令人高兴的是,尽管您需要具备一定的 C 或 C++ 工作经验才能从本书中受益,但您不需要具备任何计算机图形学知识。没有任何! GPU 编程只是为您提供了一个机会来构建并大力构建现有的编程技能。
This book shows how, by harnessing the power of your computer’s graphics process unit (GPU), you can write high-performance software for a wide range of applications. Although originally designed to render computer graphics on a monitor (and still used for this purpose), GPUs are increasingly being called upon for equally demanding programs in science, engineering, and finance, among other domains. We refer collectively to GPU programs that address problems in nongraphics domains as general-purpose. Happily, although you need to have some experience working in C or C++ to benefit from this book, you need not have any knowledge of computer graphics. None whatsoever! GPU programming simply offers you an opportunity to build—and to build mightily—on your existing programming skills.
要对 NVIDIA GPU 进行编程以执行通用计算任务,您需要了解 CUDA 是什么。 NVIDIA GPU 基于所谓的CUDA 架构构建。您可以将 CUDA 架构视为 NVIDIA 构建 GPU 的方案,该 GPU既可以执行传统的图形渲染任务,也可以执行通用任务。为了对 CUDA GPU 进行编程,我们将使用一种称为CUDA C的语言。正如您将在本书前面看到的那样,CUDA C 本质上是带有一些扩展的 C,以允许对 NVIDIA GPU 等大规模并行机器进行编程。
To program NVIDIA GPUs to perform general-purpose computing tasks, you will want to know what CUDA is. NVIDIA GPUs are built on what’s known as the CUDA Architecture. You can think of the CUDA Architecture as the scheme by which NVIDIA has built GPUs that can perform both traditional graphics-rendering tasks and general-purpose tasks. To program CUDA GPUs, we will be using a language known as CUDA C. As you will see very early in this book, CUDA C is essentially C with a handful of extensions to allow programming of massively parallel machines like NVIDIA GPUs.
我们通过示例将 CUDA面向经验丰富的 C 或 C++ 程序员,他们对 C 足够熟悉,可以轻松地用 C 语言阅读和编写代码。本书以您使用 C 的经验为基础,旨在作为示例驱动的“使用 NVIDIA CUDA C 编程语言的快速入门指南。您绝不需要做过大型软件架构,编写过 C 编译器或操作系统内核,或者了解 ANSI C 标准的所有来龙去脉。但是,我们不会花时间回顾 C 语法或常见 C 库例程(例如malloc()或 )memcpy(),因此我们假设您已经相当熟悉这些主题。
We’ve geared CUDA by Example toward experienced C or C++ programmers who have enough familiarity with C such that they are comfortable reading and writing code in C. This book builds on your experience with C and intends to serve as an example-driven, “quick-start” guide to using NVIDIA’s CUDA C programming language. By no means do you need to have done large-scale software architecture, to have written a C compiler or an operating system kernel, or to know all the ins and outs of the ANSI C standards. However, we do not spend time reviewing C syntax or common C library routines such as malloc() or memcpy(), so we will assume that you are already reasonably familiar with these topics.
您将遇到一些可以被视为通用并行编程范例的技术,尽管本书的目的不是教授通用并行编程技术。此外,虽然我们将讨论 CUDA API 的几乎每个部分,但这本书并不作为广泛的 API 参考,也不会详细介绍可用于帮助开发 CUDA C 软件的每个工具。因此,我们强烈建议将本书与 NVIDIA 免费提供的文档结合使用,特别是《NVIDIA CUDA 编程指南》和《NVIDIA CUDA 最佳实践指南》。但不要因为收集所有这些文件而感到压力,因为我们将引导您完成您需要做的一切。
You will encounter some techniques that can be considered general parallel programming paradigms, although this book does not aim to teach general parallel programming techniques. Also, while we will look at nearly every part of the CUDA API, this book does not serve as an extensive API reference nor will it go into gory detail about every tool that you can use to help develop your CUDA C software. Consequently, we highly recommend that this book be used in conjunction with NVIDIA’s freely available documentation, in particular the NVIDIA CUDA Programming Guide and the NVIDIA CUDA Best Practices Guide. But don’t stress out about collecting all these documents because we’ll walk you through everything you need to do.
所有必需的 NVIDIA 软件都可以从http://developer.nvidia.com/object/gpucomputing.html找到链接。本书的第 2 章讨论了开始编写 CUDA C 程序时哪些组件是绝对必需的。由于本书旨在通过实例进行教学,因此包含了大量的代码示例。该代码可以从http://developer.nvidia.com/object/cuda-by-example.html下载。
All of the required NVIDIA software can be found linked from http://developer.nvidia.com/object/gpucomputing.html. Chapter 2 of this book discusses which components are absolutely necessary in order to get started writing CUDA C programs. Because this book aims to teach by example, it contains a great deal of code samples. This code can be downloaded from http://developer.nvidia.com/object/cuda-by-example.html.
言归正传,使用 CUDA C 对 NVIDIA GPU 进行编程的世界正等待着您!
Without further ado, the world of programming NVIDIA GPUs with CUDA C awaits!
有人说,写一本技术书需要一个村庄的努力,CUDA by Examples也不例外。作者对许多人表示感谢,我们在此向其中一些人表示感谢。
It’s been said that it takes a village to write a technical book, and CUDA by Example is no exception to this adage. The authors owe debts of gratitude to many people, some of whom we would like to thank here.
NVIDIA GPU 计算软件高级总监 Ian Buck 在本书开发的每个阶段(从倡导理念到管理许多细节)都提供了不可估量的帮助。我们还欠蒂姆·默里(Tim Murray),他是我们总是面带微笑的审稿人,这本书在技术准确性和可读性上有一点点的贡献。还要非常感谢我们的设计师达尔文·塔特 (Darwin Tat),他在极其紧迫的时间内创作了精彩的封面艺术和人物模型。最后,我们非常感谢约翰·帕克,他帮助指导这个项目通过了出版作品所需的微妙法律程序。
Ian Buck, NVIDIA’s senior director of GPU computing software, has been immeasurably helpful in every stage of the development of this book, from championing the idea to managing many of the details. We also owe Tim Murray, our always-smiling reviewer, much of the credit for this book possessing even a modicum of technical accuracy and readability. Many thanks also go to our designer, Darwin Tat, who created fantastic cover art and figures on an extremely tight schedule. Finally, we are much obliged to John Park, who helped guide this project through the delicate legal process required of published work.
如果没有艾迪生·韦斯利工作人员的帮助,这本书在作者眼中仍然只是昙花一现。 Peter Gordon、Kim Boedigheimer 和 Julie Nahil 都表现出了无限的耐心和专业精神,真正使本书的出版过程变得轻松无忧。此外,莫莉·夏普 (Molly Sharp) 的制作工作和金·温普塞特 (Kim Wimpsett) 的文案编辑将本书从一堆充满错误的文档彻底转变为您今天正在阅读的这本书。
Without help from Addison-Wesley’s staff, this book would still be nothing more than a twinkle in the eyes of the authors. Peter Gordon, Kim Boedigheimer, and Julie Nahil have all shown unbounded patience and professionalism and have genuinely made the publication of this book a painless process. Additionally, Molly Sharp’s production work and Kim Wimpsett’s copyediting have utterly transformed this text from a pile of documents riddled with errors to the volume you’re reading today.
如果没有其他贡献者的帮助,本书的某些内容就不可能包含在内。具体来说,Nadeem Mohammad 在研究我们在第 1 章中介绍的 CUDA 案例研究方面发挥了重要作用,而 Nathan Whitehead 慷慨地提供了代码,我们将这些代码合并到整本书的示例中。
Some of the content of this book could not have been included without the help of other contributors. Specifically, Nadeem Mohammad was instrumental in researching the CUDA case studies we present in Chapter 1, and Nathan Whitehead generously provided code that we incorporated into examples throughout the book.
如果我们不感谢阅读本文早期草稿并提供有用反馈的其他人,包括 Genevieve Breed 和 Kurt Wall,我们就失职了。许多 NVIDIA 软件工程师提供了宝贵的技术在通过示例开发 CUDA 内容的过程中提供了帮助,包括 Mark Hairgrove,他仔细检查了这本书,发现了各种不一致的地方——技术、印刷和语法。 Steve Hines、Nicholas Wilt 和 Stephen Jones 就 CUDA API 的特定部分进行了咨询,帮助阐明了作者可能会忽视的细微差别。还要感谢兰迪玛·费尔南多 (Randima Fernando) 帮助该项目启动,并感谢迈克尔·希德洛夫斯基 (Michael Schidlowsky) 在他的书中对 Jason 的认可。
We would be remiss if we didn’t thank the others who read early drafts of this text and provided helpful feedback, including Genevieve Breed and Kurt Wall. Many of the NVIDIA software engineers provided invaluable technical assistance during the course of developing the content for CUDA by Example, including Mark Hairgrove who scoured the book, uncovering all manner of inconsistencies—technical, typographical, and grammatical. Steve Hines, Nicholas Wilt, and Stephen Jones consulted on specific sections of the CUDA API, helping elucidate nuances that the authors would have otherwise overlooked. Thanks also go out to Randima Fernando who helped to get this project off the ground and to Michael Schidlowsky for acknowledging Jason in his book.
如果没有对父母和兄弟姐妹发自内心的感激之情,什么致谢部分才算是完整呢?在这里,我们要感谢我们的家人,他们一直陪伴我们度过一切,使这一切成为可能。话虽如此,我们要特别感谢慈爱的父母爱德华和凯瑟琳·坎德罗以及斯蒂芬和海伦·桑德斯。还要感谢我们的兄弟肯尼思·坎德罗和科里·桑德斯。感谢大家的坚定支持。
And what acknowledgments section would be complete without a heartfelt expression of gratitude to parents and siblings? It is here that we would like to thank our families, who have been with us through everything and have made this all possible. With that said, we would like to extend special thanks to loving parents, Edward and Kathleen Kandrot and Stephen and Helen Sanders. Thanks also go to our brothers, Kenneth Kandrot and Corey Sanders. Thank you all for your unwavering support.
Jason Sanders是 NVIDIA CUDA 平台团队的高级软件工程师。在 NVIDIA 工作期间,他帮助开发了 CUDA 系统软件的早期版本,并为异构计算的行业标准 OpenCL 1.0 规范做出了贡献。 Jason 获得了加州大学伯克利分校的计算机科学硕士学位,并在那里发表了 GPU 计算方面的研究成果,并拥有普林斯顿大学的电气工程学士学位。在加入 NVIDIA 之前,他曾在 ATI Technologies、Apple 和 Novell 任职。当杰森不写书时,他通常会锻炼、踢足球或拍摄照片。
Jason Sanders is a senior software engineer in the CUDA Platform group at NVIDIA. While at NVIDIA, he helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. Jason received his master’s degree in computer science from the University of California Berkeley where he published research in GPU computing, and he holds a bachelor’s degree in electrical engineering from Princeton University. Prior to joining NVIDIA, he previously held positions at ATI Technologies, Apple, and Novell. When he’s not writing books, Jason is typically working out, playing soccer, or shooting photos.
Edward Kandrot是 NVIDIA CUDA 算法团队的高级软件工程师。他拥有 20 多年专注于优化代码和提高性能(包括 Photoshop 和 Mozilla)的行业经验。坎德罗特曾就职于 Adobe、微软和谷歌,并曾在苹果和 Autodesk 等多家公司担任顾问。当不编码时,他会玩魔兽世界或去拉斯维加斯品尝美味的食物。
Edward Kandrot is a senior software engineer on the CUDA Algorithms team at NVIDIA. He has more than 20 years of industry experience focused on optimizing code and improving performance, including for Photoshop and Mozilla. Kandrot has worked for Adobe, Microsoft, and Google, and he has been a consultant at many companies, including Apple and Autodesk. When not coding, he can be found playing World of Warcraft or visiting Las Vegas for the amazing food.
在不远的过去,曾经有一段时间,并行计算被视为一种“异国情调”的追求,并且通常被划分为计算机科学领域的一个专业。近年来,这种看法发生了深刻的变化。计算世界已经转变到这样一个地步:几乎每个有抱负的程序员都需要接受并行编程培训,才能在计算机科学中充分发挥作用,这不再是一种深奥的追求。也许您在拿起这本书时并不相信并行编程在当今计算世界中的重要性以及它在未来几年将发挥的日益重要的作用。这一介绍性章节将探讨硬件的最新趋势,这些趋势为我们程序员编写的软件带来了繁重的工作。在此过程中,我们希望让您相信并行计算革命已经发生,并且通过学习 CUDA C,您将能够为包含中央处理单元和图形处理单元的异构平台编写高性能应用程序。
There was a time in the not-so-distant past when parallel computing was looked upon as an “exotic” pursuit and typically got compartmentalized as a specialty within the field of computer science. This perception has changed in profound ways in recent years. The computing world has shifted to the point where, far from being an esoteric pursuit, nearly every aspiring programmer needs training in parallel programming to be fully effective in computer science. Perhaps you’ve picked this book up unconvinced about the importance of parallel programming in the computing world today and the increasingly large role it will play in the years to come. This introductory chapter will examine recent trends in the hardware that does the heavy lifting for the software that we as programmers write. In doing so, we hope to convince you that the parallel computing revolution has already happened and that, by learning CUDA C, you’ll be well positioned to write high-performance applications for heterogeneous platforms that contain both central and graphics processing units.
通过本章的课程,您将完成以下任务:
Through the course of this chapter, you will accomplish the following:
• 您将了解并行计算日益重要的作用。
• You will learn about the increasingly important role of parallel computing.
• 您将了解GPU 计算和CUDA 的简史。
• You will learn a brief history of GPU computing and CUDA.
• 您将了解一些使用 CUDA C 的成功应用程序。
• You will learn about some successful applications that use CUDA C.
近年来,计算行业向并行计算的广泛转变已经取得了很大进展。 2010 年几乎所有消费类计算机都将配备多核中央处理器。从双核、低端上网本机器到 8 核和 16 核工作站计算机的推出,并行计算将不再被归入奇异的超级计算机或大型机。此外,诸如移动电话和便携式音乐播放器之类的电子设备已经开始结合并行计算能力,以努力提供远远超出其前辈的功能。
In recent years, much has been made of the computing industry’s widespread shift to parallel computing. Nearly all consumer computers in the year 2010 will ship with multicore central processors. From the introduction of dual-core, low-end netbook machines to 8- and 16-core workstation computers, no longer will parallel computing be relegated to exotic supercomputers or mainframes. Moreover, electronic devices such as mobile phones and portable music players have begun to incorporate parallel computing capabilities in an effort to provide functionality well beyond those of their predecessors.
软件开发人员越来越需要应对各种并行计算平台和技术,以便为日益复杂的用户群提供新颖而丰富的体验。命令提示符消失了;多线程图形界面流行起来。只能打电话的手机已经过时了。可以同时播放音乐、浏览网页和提供 GPS 服务的手机已经出现。
More and more, software developers will need to cope with a variety of parallel computing platforms and technologies in order to provide novel and rich experiences for an increasingly sophisticated base of users. Command prompts are out; multithreaded graphical interfaces are in. Cellular phones that only make calls are out; phones that can simultaneously play music, browse the Web, and provide GPS services are in.
30 年来,提高消费计算设备性能的重要方法之一就是提高处理器时钟的运行速度。从 20 世纪 80 年代初的第一台个人计算机开始,消费类中央处理单元 (CPU) 的内部时钟频率约为 1MHz。大约 30 年后,大多数台式机处理器的时钟速度在 1GHz 到 4GHz 之间,比计算机上的时钟快近 1,000 倍。原装个人电脑。虽然提高CPU主频当然不是提高计算性能的唯一方法,但它始终是提高性能的可靠来源。
For 30 years, one of the important methods for improving the performance of consumer computing devices has been to increase the speed at which the processor’s clock operated. Starting with the first personal computers of the early 1980s, consumer central processing units (CPUs) ran with internal clocks operating around 1MHz. About 30 years later, most desktop processors have clock speeds between 1GHz and 4GHz, nearly 1,000 times faster than the clock on the original personal computer. Although increasing the CPU clock speed is certainly not the only method by which computing performance has been improved, it has always been a reliable source for improved performance.
然而,近年来,制造商被迫寻找这种传统计算能力来源的替代方案。由于集成电路制造中的各种基本限制,依靠向上螺旋式处理器时钟速度作为从现有架构中提取额外功率的手段不再可行。由于功率和热量的限制以及晶体管尺寸迅速接近的物理极限,研究人员和制造商已经开始寻找其他地方。
In recent years, however, manufacturers have been forced to look for alternatives to this traditional source of increased computational power. Because of various fundamental limitations in the fabrication of integrated circuits, it is no longer feasible to rely on upward-spiraling processor clock speeds as a means for extracting additional power from existing architectures. Because of power and heat restrictions as well as a rapidly approaching physical limit to transistor size, researchers and manufacturers have begun to look elsewhere.
在消费计算领域之外,超级计算机几十年来也以类似的方式获得了巨大的性能提升。超级计算机中使用的处理器的性能已呈天文数字攀升,类似于个人计算机 CPU 的改进。然而,除了单个处理器性能的大幅提升之外,超级计算机制造商还通过稳步增加处理器数量来实现性能的巨大飞跃。对于最快的超级计算机来说,拥有数万或数十万个处理器核心协同工作并不罕见。
Outside the world of consumer computing, supercomputers have for decades extracted massive performance gains in similar ways. The performance of a processor used in a supercomputer has climbed astronomically, similar to the improvements in the personal computer CPU. However, in addition to dramatic improvements in the performance of a single processor, supercomputer manufacturers have also extracted massive leaps in performance by steadily increasing the number of processors. It is not uncommon for the fastest supercomputers to have tens or hundreds of thousands of processor cores working in tandem.
在寻求个人计算机额外处理能力的过程中,超级计算机的改进提出了一个非常好的问题:与其仅仅寻求提高单个处理核心的性能,为什么不在个人计算机中放置多个处理核心呢?通过这种方式,个人计算机可以继续提高性能,而无需继续提高处理器时钟速度。
In the search for additional processing power for personal computers, the improvement in supercomputers raises a very good question: Rather than solely looking to increase the performance of a single processing core, why not put more than one in a personal computer? In this way, personal computers could continue to improve in performance without the need for continuing increases in processor clock speed.
2005 年,面对竞争日益激烈的市场和很少的替代品,领先的 CPU 制造商开始提供具有两个计算核心而不是一个计算核心的处理器。在接下来的几年里,他们紧随这一发展,发布了三核、四核、六核和八核中央处理器单元。这种趋势有时被称为多核革命,标志着消费计算市场发展的巨大转变。
In 2005, faced with an increasingly competitive marketplace and few alternatives, leading CPU manufacturers began offering processors with two computing cores instead of one. Over the following years, they followed this development with the release of three-, four-, six-, and eight-core central processor units. Sometimes referred to as the multicore revolution, this trend has marked a huge shift in the evolution of the consumer computing market.
如今,购买一台仅包含单个计算核心的 CPU 的台式计算机相对具有挑战性。即使是低端、低功耗中央处理器,每个芯片也配备两个或更多内核。领先的 CPU 制造商已经宣布了 12 核和 16 核 CPU 的计划,进一步证实了并行计算已经永久到来。
Today, it is relatively challenging to purchase a desktop computer with a CPU containing but a single computing core. Even low-end, low-power central processors ship with two or more cores per die. Leading CPU manufacturers have already announced plans for 12- and 16-core CPUs, further confirming that parallel computing has arrived for good.
与中央处理器的传统数据处理管道相比,在图形处理单元(GPU)上执行通用计算是一个新概念。事实上,与整个计算领域相比,GPU 本身相对较新。然而,在图形处理器上进行计算的想法并不像您想象的那么新。
In comparison to the central processor’s traditional data processing pipeline, performing general-purpose computations on a graphics processing unit (GPU) is a new concept. In fact, the GPU itself is relatively new compared to the computing field at large. However, the idea of computing on graphics processors is not as new as you might believe.
我们已经了解了中央处理器在时钟速度和核心数量方面的演变。与此同时,图形处理的状态经历了一场巨大的革命。在 20 世纪 80 年代末和 90 年代初,图形驱动操作系统(例如 Microsoft Windows)的普及为新型处理器创造了市场。 20 世纪 90 年代初,用户开始为其个人电脑购买 2D 显示加速器。这些显示加速器提供硬件辅助位图操作,以协助图形操作系统的显示和可用性。
We have already looked at how central processors evolved in both clock speeds and core count. In the meantime, the state of graphics processing underwent a dramatic revolution. In the late 1980s and early 1990s, the growth in popularity of graphically driven operating systems such as Microsoft Windows helped create a market for a new type of processor. In the early 1990s, users began purchasing 2D display accelerators for their personal computers. These display accelerators offered hardware-assisted bitmap operations to assist in the display and usability of graphical operating systems.
大约在同一时间,在专业计算领域,一家名为 Silicon Graphics 的公司在 20 世纪 80 年代在各种市场中普及了 3D 图形的使用,包括政府和国防应用以及科学和技术可视化。提供创造令人惊叹的电影效果的工具。 1992 年,Silicon Graphics 通过发布 OpenGL 库开放了其硬件的编程接口。 Silicon Graphics 打算将 OpenGL 用作编写 3D 图形应用程序的标准化、独立于平台的方法。与并行处理和 CPU 一样,这些技术进入消费者应用程序只是时间问题。
Around the same time, in the world of professional computing, a company by the name of Silicon Graphics spent the 1980s popularizing the use of three-dimensional graphics in a variety of markets, including government and defense applications and scientific and technical visualization, as well as providing the tools to create stunning cinematic effects. In 1992, Silicon Graphics opened the programming interface to its hardware by releasing the OpenGL library. Silicon Graphics intended OpenGL to be used as a standardized, platform-independent method for writing 3D graphics applications. As with parallel processing and CPUs, it would only be a matter of time before the technologies found their way into consumer applications.
到 20 世纪 90 年代中期,对采用 3D 图形的消费应用程序的需求迅速增长,为两项相当重大的发展奠定了基础。首先,《毁灭战士》、《毁灭公爵 3D》和《雷神之锤》等沉浸式第一人称游戏的发布激发了人们对为 PC 游戏创建更加真实的 3D 环境的追求。尽管 3D 图形最终会渗透到几乎所有电脑游戏中,但新兴的第一人称射击游戏类型的流行将显着加速 3D 图形在消费计算中的采用。与此同时,NVIDIA、ATI Technologies 和 3dfx Interactive 等公司开始发布价格实惠的图形加速器足以引起广泛关注。这些发展巩固了 3D 图形作为一项将在未来几年占据重要地位的技术。
By the mid-1990s, the demand for consumer applications employing 3D graphics had escalated rapidly, setting the stage for two fairly significant developments. First, the release of immersive, first-person games such as Doom, Duke Nukem 3D, and Quake helped ignite a quest to create progressively more realistic 3D environments for PC gaming. Although 3D graphics would eventually work their way into nearly all computer games, the popularity of the nascent first-person shooter genre would significantly accelerate the adoption of 3D graphics in consumer computing. At the same time, companies such as NVIDIA, ATI Technologies, and 3dfx Interactive began releasing graphics accelerators that were affordable enough to attract widespread attention. These developments cemented 3D graphics as a technology that would figure prominently for years to come.
NVIDIA GeForce 256 的发布进一步推动了消费类图形硬件的性能。第一次可以直接在图形处理器上执行变换和照明计算,从而增强了视觉上更有趣的应用程序的潜力。由于变换和光照已经成为 OpenGL 图形管道的组成部分,GeForce 256 标志着自然发展的开始,越来越多的图形管道将直接在图形处理器上实现。
The release of NVIDIA’s GeForce 256 further pushed the capabilities of consumer graphics hardware. For the first time, transform and lighting computations could be performed directly on the graphics processor, thereby enhancing the potential for even more visually interesting applications. Since transform and lighting were already integral parts of the OpenGL graphics pipeline, the GeForce 256 marked the beginning of a natural progression where increasingly more of the graphics pipeline would be implemented directly on the graphics processor.
从并行计算的角度来看,NVIDIA 在 2001 年发布的 GeForce 3 系列可以说代表了 GPU 技术最重要的突破。 GeForce 3 系列是计算行业第一款实现微软当时新的 DirectX 8.0 标准的芯片。该标准要求兼容的硬件包含可编程顶点和可编程像素着色阶段。开发人员第一次对在 GPU 上执行的精确计算有了一定的控制权。
From a parallel-computing standpoint, NVIDIA’s release of the GeForce 3 series in 2001 represents arguably the most important breakthrough in GPU technology. The GeForce 3 series was the computing industry’s first chip to implement Microsoft’s then-new DirectX 8.0 standard. This standard required that compliant hardware contain both programmable vertex and programmable pixel shading stages. For the first time, developers had some control over the exact computations that would be performed on their GPUs.
拥有可编程管道的 GPU 的发布吸引了许多研究人员,让他们意识到使用图形硬件进行不仅仅是基于 OpenGL 或 DirectX 的渲染的可能性。 GPU 计算早期的一般方法极其复杂。由于 OpenGL 和 DirectX 等标准图形 API 仍然是与 GPU 交互的唯一方式,因此任何在 GPU 上执行任意计算的尝试仍然会受到图形 API 内编程的限制。正因为如此,研究人员通过图形 API 探索通用计算,试图让他们的问题在 GPU 看来是传统渲染。
The release of GPUs that possessed programmable pipelines attracted many researchers to the possibility of using graphics hardware for more than simply OpenGL- or DirectX-based rendering. The general approach in the early days of GPU computing was extraordinarily convoluted. Because standard graphics APIs such as OpenGL and DirectX were still the only way to interact with a GPU, any attempt to perform arbitrary computations on a GPU would still be subject to the constraints of programming within a graphics API. Because of this, researchers explored general-purpose computation through graphics APIs by trying to make their problems appear to the GPU to be traditional rendering.
从本质上讲,2000 年代初期的 GPU 旨在使用称为像素着色器的可编程算术单元为屏幕上的每个像素生成颜色。一般来说,像素着色器使用其(x,y)在屏幕上的位置以及一些附加信息来组合各种输入来计算最终颜色。附加信息可以是输入颜色、纹理坐标或运行时传递给着色器的其他属性。但由于对输入颜色和纹理执行的算术完全由程序员控制,研究人员观察到这些输入“颜色”实际上可以是任何数据。
Essentially, the GPUs of the early 2000s were designed to produce a color for every pixel on the screen using programmable arithmetic units known as pixel shaders. In general, a pixel shader uses its (x,y) position on the screen as well as some additional information to combine various inputs in computing a final color. The additional information could be input colors, texture coordinates, or other attributes that would be passed to the shader when it ran. But because the arithmetic being performed on the input colors and textures was completely controlled by the programmer, researchers observed that these input “colors” could actually be any data.
因此,如果输入实际上是表示颜色以外的数字数据,那么程序员就可以对像素着色器进行编程,以对该数据执行任意计算。结果将作为最终像素“颜色”返回给 GPU,尽管这些颜色只是程序员指示 GPU 对输入执行的任何计算的结果。研究人员可以读回这些数据,而 GPU 永远不会更聪明。从本质上讲,GPU 被欺骗去执行非渲染任务,方法是让这些任务看起来就像是标准渲染一样。这个诡计非常巧妙,但也非常复杂。
So if the inputs were actually numerical data signifying something other than color, programmers could then program the pixel shaders to perform arbitrary computations on this data. The results would be handed back to the GPU as the final pixel “color,” although the colors would simply be the result of whatever computations the programmer had instructed the GPU to perform on their inputs. This data could be read back by the researchers, and the GPU would never be the wiser. In essence, the GPU was being tricked into performing nonrendering tasks by making those tasks appear as if they were a standard rendering. This trickery was very clever but also very convoluted.
由于 GPU 的高算力吞吐量,这些实验的初步结果预示着 GPU 计算的光明前景。然而,对于任何关键的开发人员来说,编程模型仍然过于严格。资源限制很严格,因为程序只能从少数输入颜色和少数纹理单元接收输入数据。程序员将结果写入内存的方式和位置存在严重限制,因此需要写入内存中任意位置(分散)的算法无法在 GPU 上运行。此外,几乎不可能预测您的特定 GPU 将如何处理浮点数据(如果它确实处理浮点数据),因此大多数科学计算将无法使用 GPU。最后,当程序不可避免地计算出不正确的结果、无法终止或只是挂起机器时,就没有相当好的方法来调试在 GPU 上执行的任何代码。
Because of the high arithmetic throughput of GPUs, initial results from these experiments promised a bright future for GPU computing. However, the programming model was still far too restrictive for any critical mass of developers to form. There were tight resource constraints, since programs could receive input data only from a handful of input colors and a handful of texture units. There were serious limitations on how and where the programmer could write results to memory, so algorithms requiring the ability to write to arbitrary locations in memory (scatter) could not run on a GPU. Moreover, it was nearly impossible to predict how your particular GPU would deal with floating-point data, if it handled floating-point data at all, so most scientific computations would be unable to use a GPU. Finally, when the program inevitably computed the incorrect results, failed to terminate, or simply hung the machine, there existed no reasonably good method to debug any code that was being executed on the GPU.
似乎这些限制还不够严重,任何仍想使用 GPU 执行通用计算的人都需要学习 OpenGL 或 DirectX,因为它们仍然是与 GPU 交互的唯一方式。这不仅意味着将数据存储在图形纹理中并通过调用 OpenGL 或 DirectX 函数执行计算,还意味着使用称为着色语言的特殊图形编程语言本身编写计算。事实证明,在尝试利用 GPU 的计算能力之前,要求研究人员应对严格的资源和编程限制,并学习计算机图形和着色语言,这对于广泛接受来说是一个太大的障碍。
As if the limitations weren’t severe enough, anyone who still wanted to use a GPU to perform general-purpose computations would need to learn OpenGL or DirectX since these remained the only means by which one could interact with a GPU. Not only did this mean storing data in graphics textures and executing computations by calling OpenGL or DirectX functions, but it meant writing the computations themselves in special graphics-only programming languages known as shading languages. Asking researchers to both cope with severe resource and programming restrictions as well as to learn computer graphics and shading languages before attempting to harness the computing power of their GPU proved too large a hurdle for wide acceptance.
直到 GeForce 3 系列发布五年后,GPU 计算才迎来黄金时期。 2006 年 11 月,NVIDIA 推出了业界首款 DirectX 10 GPU,GeForce 8800 GTX。 GeForce 8800 GTX 也是首款采用 NVIDIA CUDA 架构构建的 GPU。该架构包括几个严格为 GPU 计算设计的新组件,旨在缓解许多阻止以前的图形处理器合法用于通用计算的限制。
It would not be until five years after the release of the GeForce 3 series that GPU computing would be ready for prime time. In November 2006, NVIDIA unveiled the industry’s first DirectX 10 GPU, the GeForce 8800 GTX. The GeForce 8800 GTX was also the first GPU to be built with NVIDIA’s CUDA Architecture. This architecture included several new components designed strictly for GPU computing and aimed to alleviate many of the limitations that prevented previous graphics processors from being legitimately useful for general-purpose computation.
与将计算资源划分为顶点和像素着色器的前几代产品不同,CUDA 架构包含统一的着色器管道,允许芯片上的每个算术逻辑单元 (ALU) 由旨在执行通用计算的程序进行编组。由于 NVIDIA 打算将这一新的图形处理器系列用于通用计算,因此这些 ALU 的构建符合 IEEE 对单精度浮点运算的要求,并且设计为使用为通用计算而不是专门计算而定制的指令集。用于图形。此外,GPU 上的执行单元可以对内存进行任意读写访问,以及访问软件管理的缓存(称为共享内存)。添加 CUDA 架构的所有这些功能是为了创建一个除了在传统图形任务上表现良好之外还擅长计算的 GPU。
Unlike previous generations that partitioned computing resources into vertex and pixel shaders, the CUDA Architecture included a unified shader pipeline, allowing each and every arithmetic logic unit (ALU) on the chip to be marshaled by a program intending to perform general-purpose computations. Because NVIDIA intended this new family of graphics processors to be used for general-purpose computing, these ALUs were built to comply with IEEE requirements for single-precision floating-point arithmetic and were designed to use an instruction set tailored for general computation rather than specifically for graphics. Furthermore, the execution units on the GPU were allowed arbitrary read and write access to memory as well as access to a software-managed cache known as shared memory. All of these features of the CUDA Architecture were added in order to create a GPU that would excel at computation in addition to performing well at traditional graphics tasks.
不过,NVIDIA 为消费者提供计算和图形产品的努力并不仅限于生产包含 CUDA 架构的硬件。无论 NVIDIA 在其芯片中添加了多少功能来促进计算,如果不使用 OpenGL 或 DirectX,仍然无法访问这些功能。这不仅要求用户继续将其计算伪装成图形问题,而且还需要继续使用面向图形的着色语言(例如 OpenGL 的 GLSL 或 Microsoft 的 HLSL)编写计算。
The effort by NVIDIA to provide consumers with a product for both computation and graphics could not stop at producing hardware incorporating the CUDA Architecture, though. Regardless of how many features NVIDIA added to its chips to facilitate computing, there continued to be no way to access these features without using OpenGL or DirectX. Not only would this have required users to continue to disguise their computations as graphics problems, but they would have needed to continue writing their computations in a graphics-oriented shading language such as OpenGL’s GLSL or Microsoft’s HLSL.
为了达到尽可能多的开发人员数量,NVIDIA 采用了行业标准 C 语言并添加了相对少量的关键字,以利用 CUDA 架构的一些特殊功能。 GeForce 8800 GTX 推出几个月后,NVIDIA 公开了该语言的编译器 CUDA C。由此,CUDA C 成为第一个由 GPU 公司专门设计的语言,用于促进 GPU 上的通用计算。
To reach the maximum number of developers possible, NVIDIA took industry-standard C and added a relatively small number of keywords in order to harness some of the special features of the CUDA Architecture. A few months after the launch of the GeForce 8800 GTX, NVIDIA made public a compiler for this language, CUDA C. And with that, CUDA C became the first language specifically designed by a GPU company to facilitate general-purpose computing on GPUs.
除了创建一种为 GPU 编写代码的语言之外,NVIDIA 还提供了专门的硬件驱动程序来利用 CUDA 架构的海量计算能力。用户不再需要了解 OpenGL 或 DirectX 图形编程接口的任何知识,也不需要强迫他们的问题看起来像计算机图形任务。
In addition to creating a language to write code for the GPU, NVIDIA also provides a specialized hardware driver to exploit the CUDA Architecture’s massive computational power. Users are no longer required to have any knowledge of the OpenGL or DirectX graphics programming interfaces, nor are they required to force their problem to look like a computer graphics task.
自 2007 年初首次亮相以来,许多行业和应用程序都通过选择在 CUDA C 中构建应用程序而取得了巨大的成功。这些好处通常包括比以前最先进的技术实现数量级的性能改进实施。此外,与专门基于传统中央处理技术构建的实施相比,在 NVIDIA 图形处理器上运行的应用程序具有卓越的性价比和每瓦性能。以下仅介绍了人们成功使用 CUDA C 和 CUDA 架构的几种方法。
Since its debut in early 2007, a variety of industries and applications have enjoyed a great deal of success by choosing to build applications in CUDA C. These benefits often include orders-of-magnitude performance improvement over the previous state-of-the-art implementations. Furthermore, applications running on NVIDIA graphics processors enjoy superior performance per dollar and performance per watt than implementations built exclusively on traditional central processing technologies. The following represent just a few of the ways in which people have put CUDA C and the CUDA Architecture into successful use.
过去 20 年来,受乳腺癌悲剧影响的人数急剧增加。在很大程度上,由于许多人的不懈努力,近年来,人们对预防和治疗这种可怕疾病的认识和研究也同样有所提高。最终,每一个乳腺癌病例都应该尽早发现,以防止放疗和化疗带来的严重副作用、手术留下的永久提醒,以及对治疗无效的病例造成的致命后果。因此,研究人员强烈希望找到快速、准确和微创的方法来识别乳腺癌的早期迹象。
The number of people who have been affected by the tragedy of breast cancer has dramatically risen over the course of the past 20 years. Thanks in a large part to the tireless efforts of many, awareness and research into preventing and curing this terrible disease has similarly risen in recent years. Ultimately, every case of breast cancer should be caught early enough to prevent the ravaging side effects of radiation and chemotherapy, the permanent reminders left by surgery, and the deadly consequences in cases that fail to respond to treatment. As a result, researchers share a strong desire to find fast, accurate, and minimally invasive ways to identify the early signs of breast cancer.
乳房X光检查是目前早期发现乳腺癌的最佳技术之一,但它有几个明显的局限性。需要拍摄两张或更多图像,并且需要由熟练的医生冲洗和读取胶片以识别潜在的肿瘤。此外,这种 X 射线手术还存在反复辐射患者胸部的所有风险。经过仔细研究后,医生通常需要进一步、更具体的成像,甚至活检,以试图消除癌症的可能性。这些误报会导致昂贵的后续工作,并给患者带来过度的压力,直到得出最终结论。
The mammogram, one of the current best techniques for the early detection of breast cancer, has several significant limitations. Two or more images need to be taken, and the film needs to be developed and read by a skilled doctor to identify potential tumors. Additionally, this X-ray procedure carries with it all the risks of repeatedly radiating a patient’s chest. After careful study, doctors often require further, more specific imaging—and even biopsy—in an attempt to eliminate the possibility of cancer. These false positives incur expensive follow-up work and cause undue stress to the patient until final conclusions can be drawn.
超声成像比 X 射线成像更安全,因此医生经常将其与乳房 X 光检查结合使用,以协助乳腺癌的护理和诊断。但传统的乳房超声检查也有其局限性。因此,TechniScan 医疗系统诞生了。 TechniScan 开发了一种很有前途的三维超声成像方法,但其解决方案尚未付诸实践,原因很简单:计算限制。简而言之,将收集到的超声数据转换为三维图像需要进行计算,这在实际使用中被认为非常耗时且昂贵。
Ultrasound imaging is safer than X-ray imaging, so doctors often use it in conjunction with mammography to assist in breast cancer care and diagnosis. But conventional breast ultrasound has its limitations as well. As a result, TechniScan Medical Systems was born. TechniScan has developed a promising, three-dimensional, ultrasound imaging method, but its solution had not been put into practice for a very simple reason: computation limitations. Simply put, converting the gathered ultrasound data into the three-dimensional imagery required computation considered prohibitively time-consuming and expensive for practical use.
NVIDIA 首款基于 CUDA 架构的 GPU 及其 CUDA C 编程语言的推出为 TechniScan 提供了一个平台,使 TechniScan 能够将其创始人的梦想变成现实。顾名思义,其 Svara 超声成像系统使用超声波对患者胸部进行成像。 TechniScan Svara 系统依靠两个 NVIDIA Tesla C1060 处理器来处理 15 分钟扫描生成的 35GB 数据。得益于 Tesla C1060 的计算能力,医生可以在 20 分钟内处理出女性乳房的高度详细的三维图像。 TechniScan 预计从 2010 年开始广泛部署其 Svara 系统。
The introduction of NVIDIA’s first GPU based on the CUDA Architecture along with its CUDA C programming language provided a platform on which TechniScan could convert the dreams of its founders into reality. As the name indicates, its Svara ultrasound imaging system uses ultrasonic waves to image the patient’s chest. The TechniScan Svara system relies on two NVIDIA Tesla C1060 processors in order to process the 35GB of data generated by a 15-minute scan. Thanks to the computational horsepower of the Tesla C1060, within 20 minutes the doctor can manipulate a highly detailed, three-dimensional image of the woman’s breast. TechniScan expects wide deployment of its Svara system starting in 2010.
多年来,高效转子和叶片的设计仍然是一种魔法。这些设备周围的空气和液体的运动极其复杂,无法通过简单的公式进行有效建模,因此准确的模拟计算成本太高,不切实际。只有世界上最大的超级计算机才有希望提供与开发和验证设计所需的复杂数值模型相当的计算资源。由于很少有人能够使用此类机器,因此此类机器的设计创新继续停滞不前。
For many years, the design of highly efficient rotors and blades remained a black art of sorts. The astonishingly complex movement of air and fluids around these devices cannot be effectively modeled by simple formulations, so accurate simulations prove far too computationally expensive to be realistic. Only the largest supercomputers in the world could hope to offer computational resources on par with the sophisticated numerical models required to develop and validate designs. Since few have access to such machines, innovation in the design of such machines continued to stagnate.
剑桥大学秉承查尔斯·巴贝奇 (Charles Babbage) 开创的伟大传统,是先进并行计算积极研究的发源地。 “众核小组”的 Graham Pullan 博士和 Tobias Brandvik 博士正确地认识到了 NVIDIA CUDA 架构将计算流体动力学加速到前所未有的水平的潜力。他们的初步调查表明,GPU 驱动的个人工作站可以提供可接受的性能水平。后来,使用小型 GPU 集群的性能轻松超过了昂贵得多的超级计算机,并进一步证实了他们的怀疑,即 NVIDIA GPU 的功能与他们想要解决的问题非常匹配。
The University of Cambridge, in a great tradition started by Charles Babbage, is home to active research into advanced parallel computing. Dr. Graham Pullan and Dr. Tobias Brandvik of the “many-core group” correctly identified the potential in NVIDIA’s CUDA Architecture to accelerate computational fluid dynamics to unprecedented levels. Their initial investigations indicated that acceptable levels of performance could be delivered by GPU-powered, personal workstations. Later, the use of a small GPU cluster easily outperformed their much more costly supercomputers and further confirmed their suspicions that the capabilities of NVIDIA’s GPU matched extremely well with the problems they wanted to solve.
对于剑桥大学的研究人员来说,CUDA C 提供的巨大性能提升不仅仅是对超级计算资源的简单增量提升。大量低成本 GPU 计算的可用性使剑桥研究人员能够进行快速实验。在几秒钟内收到实验结果简化了研究人员实现突破所依赖的反馈过程。因此,GPU 集群的使用从根本上改变了他们的研究方式。近乎交互式的模拟为之前受抑制的研究领域的创新和创造力带来了新的机会。
For the researchers at Cambridge, the massive performance gains offered by CUDA C represent more than a simple, incremental boost to their supercomputing resources. The availability of copious amounts of low-cost GPU computation empowered the Cambridge researchers to perform rapid experimentation. Receiving experimental results within seconds streamlined the feedback process on which researchers rely in order to arrive at breakthroughs. As a result, the use of GPU clusters has fundamentally transformed the way they approach their research. Nearly interactive simulation has unleashed new opportunities for innovation and creativity in a previously stifled field of research.
全球经济快速工业化的自然结果是对环保消费品的需求日益增长。人们对气候变化、燃料价格不断上涨以及空气和水中污染物含量不断增加的担忧日益加剧,这些都使工业产出的成功进步所带来的附带损害凸显出来。长期以来,洗涤剂和清洁剂一直是日常使用中最必需但可能带来灾难性的消费品。因此,许多科学家开始探索减少此类清洁剂对环境影响而不降低其功效的方法。然而,不劳而获可能是一个棘手的问题。
The increasing need for environmentally sound consumer goods has arisen as a natural consequence of the rapidly escalating industrialization of the global economy. Growing concerns over climate change, the spiraling prices of fuel, and the growing level of pollutants in our air and water have brought into sharp relief the collateral damage of such successful advances in industrial output. Detergents and cleaning agents have long been some of the most necessary yet potentially calamitous consumer products in regular use. As a result, many scientists have begun exploring methods for reducing the environmental impact of such detergents without reducing their efficacy. Gaining something for nothing can be a tricky proposition, however.
清洁剂的关键成分被称为表面活性剂。表面活性剂分子决定了洗涤剂和洗发水的清洁能力和质地,但它们通常被认为是清洁产品中对环境最具破坏性的成分。这些分子附着在污垢上,然后与水混合,这样表面活性剂就可以与污垢一起被冲洗掉。传统上,测量新型表面活性剂的清洁价值需要进行广泛的实验室测试,涉及要清洁的材料和杂质的多种组合。毫不奇怪,这个过程可能非常缓慢且昂贵。
The key components to cleaning agents are known as surfactants. Surfactant molecules determine the cleaning capacity and texture of detergents and shampoos, but they are often implicated as the most environmentally devastating component of cleaning products. These molecules attach themselves to dirt and then mix with water such that the surfactants can be rinsed away along with the dirt. Traditionally, measuring the cleaning value of a new surfactant would require extensive laboratory testing involving numerous combinations of materials and impurities to be cleaned. This process, not surprisingly, can be very slow and expensive.
天普大学一直与行业领导者宝洁公司合作,利用分子模拟表面活性剂与污垢、水和其他材料的相互作用。计算机模拟的引入不仅可以加速传统的实验室方法,而且可以将测试的广度扩展到多种环境条件,远远超出了过去实际测试的范围。天普大学的研究人员使用了由能源部艾姆斯实验室编写的 GPU 加速的高度优化的面向对象的多粒子动力学 (HOOMD) 模拟软件。通过将他们的模拟分成两个NVIDIA Tesla GPU 能够实现与 Cray XT3 的 128 个 CPU 内核和 IBM BlueGene/L 机器的 1024 个 CPU 相当的性能。通过增加解决方案中 Tesla GPU 的数量,他们模拟表面活性剂相互作用的性能已是之前平台的 16 倍。由于 NVIDIA 的 CUDA 将完成此类综合模拟的时间从几周缩短到几个小时,因此未来几年将会出现大量既提高效率又减少环境影响的产品。
Temple University has been working with industry leader Procter & Gamble to use molecular simulation of surfactant interactions with dirt, water, and other materials. The introduction of computer simulations serves not just to accelerate a traditional lab approach, but it extends the breadth of testing to numerous variants of environmental conditions, far more than could be practically tested in the past. Temple researchers used the GPU-accelerated Highly Optimized Object-oriented Many-particle Dynamics (HOOMD) simulation software written by the Department of Energy’s Ames Laboratory. By splitting their simulation across two NVIDIA Tesla GPUs, they were able to achieve equivalent performance to the 128 CPU cores of the Cray XT3 and to the 1024 CPUs of an IBM BlueGene/L machine. By increasing the number of Tesla GPUs in their solution, they are already simulating surfactant interactions at 16 times the performance of previous platforms. Since NVIDIA’s CUDA has reduced the time to complete such comprehensive simulations from several weeks to a few hours, the years to come should offer a dramatic rise in products that have both increased effectiveness and reduced environmental impact.
计算行业正处于并行计算革命的边缘,而 NVIDIA 的 CUDA C 迄今为止已成为迄今为止为并行计算设计的最成功的语言之一。在本书的整个过程中,我们将帮助您学习如何在 CUDA C 中编写自己的代码。我们将帮助您学习 C 的特殊扩展以及 NVIDIA 为 GPU 计算而创建的应用程序编程接口。您不需要了解 OpenGL 或 DirectX,也不需要有任何计算机图形学背景。
The computing industry is at the precipice of a parallel computing revolution, and NVIDIA’s CUDA C has thus far been one of the most successful languages ever designed for parallel computing. Throughout the course of this book, we will help you learn how to write your own code in CUDA C. We will help you learn the special extensions to C and the application programming interfaces that NVIDIA has created in service of GPU computing. You are not expected to know OpenGL or DirectX, nor are you expected to have any background in computer graphics.
我们不会涵盖 C 语言编程的基础知识,因此我们不向完全不熟悉计算机编程的人推荐这本书。尽管我们不期望您做过任何并行编程,但对并行编程有一定的了解可能会有所帮助。您需要理解的与并行编程相关的任何术语或概念将在文本中进行解释。事实上,在某些情况下,您可能会发现传统并行编程知识会导致您对 GPU 计算做出一些事实证明不正确的假设。因此,实际上,具备一定的 C 或 C++ 编程经验是读完本书的唯一先决条件。
We will not be covering the basics of programming in C, so we do not recommend this book to people completely new to computer programming. Some familiarity with parallel programming might help, although we do not expect you to have done any parallel programming. Any terms or concepts related to parallel programming that you will need to understand will be explained in the text. In fact, there may be some occasions when you find that knowledge of traditional parallel programming will cause you to make assumptions about GPU computing that prove untrue. So in reality, a moderate amount of experience with C or C++ programming is the only prerequisite to making it through this book.
在下一章中,我们将帮助您设置用于 GPU 计算的机器,确保您拥有开始使用所需的硬件和软件组件。之后,您就可以开始使用 CUDA C。如果您已经有一些 CUDA C 的经验,或者您确定您的系统已正确设置为在 CUDA C 中进行开发,您可以跳至第3章。
In the next chapter, we will help you set up your machine for GPU computing, ensuring that you have both the hardware and the software components necessary get started. After that, you’ll be ready to get your hands dirty with CUDA C. If you already have some experience with CUDA C or you’re sure that your system has been properly set up to do development in CUDA C, you can skip to Chapter 3.
我们希望第一章能让您对开始学习 CUDA C 感到兴奋。由于本书旨在通过一系列编码示例来教您该语言,因此您需要一个功能正常的开发环境。当然,您可以站在一旁观看,但我们认为,如果您尽快投入并获得一些破解 CUDA C 代码的实际经验,您将会获得更多乐趣并保持更长时间的兴趣。本着这一精神,本章将引导您了解入门所需的一些硬件和软件组件。好消息是,您可以免费获得所需的所有软件,这样您就可以有更多的钱来购买您喜欢的任何软件。
We hope that Chapter 1 has gotten you excited to get started learning CUDA C. Since this book intends to teach you the language through a series of coding examples, you’ll need a functioning development environment. Sure, you could stand on the sideline and watch, but we think you’ll have more fun and stay interested longer if you jump in and get some practical experience hacking CUDA C code as soon as possible. In this vein, this chapter will walk you through some of the hardware and software components you’ll need in order to get started. The good news is that you can obtain all of the software you’ll need for free, leaving you more money for whatever tickles your fancy.
通过本章的课程,您将完成以下任务:
Through the course of this chapter, you will accomplish the following:
• 您将下载本书所需的所有软件组件。
• You will download all the software components required through this book.
• 您将建立一个可以构建用 CUDA C 编写的代码的环境。
• You will set up an environment in which you can build code written in CUDA C.
在开始此旅程之前,您需要设置一个可以使用 CUDA C 进行开发的环境。在 CUDA C 中开发代码的先决条件如下:
Before embarking on this journey, you will need to set up an environment in which you can develop using CUDA C. The prerequisites to developing code in CUDA C are as follows:
• 支持 CUDA 的图形处理器
• A CUDA-enabled graphics processor
• NVIDIA 设备驱动程序
• An NVIDIA device driver
• CUDA 开发工具包
• A CUDA development toolkit
• 标准C 编译器
• A standard C compiler
为了使本章尽可能轻松,我们现在将逐一介绍这些先决条件。
To make this chapter as painless as possible, we’ll walk through each of these prerequisites now.
幸运的是,您应该很容易找到基于 CUDA 架构构建的图形处理器,因为自 2006 年发布 GeForce 8800 GTX 以来,所有 NVIDIA GPU 都支持 CUDA。由于 NVIDIA 定期发布基于 CUDA 架构的新 GPU,因此以下无疑只是支持 CUDA 的 GPU 的部分列表。尽管如此,GPU 都支持 CUDA。
Fortunately, it should be easy to find yourself a graphics processor that has been built on the CUDA Architecture because every NVIDIA GPU since the 2006 release of the GeForce 8800 GTX has been CUDA-enabled. Since NVIDIA regularly releases new GPUs based on the CUDA Architecture, the following will undoubtedly be only a partial list of CUDA-enabled GPUs. Nevertheless, the GPUs are all CUDA-capable.
要获得完整列表,您应该查阅 NVIDIA 网站:www.nvidia.com/cuda,尽管可以肯定地假设所有具有超过 256MB 显存的最新 GPU(2007 年以来的 GPU)都可用于开发和运行用 CUDA C 编写的代码。
For a complete list, you should consult the NVIDIA website at www.nvidia.com/cuda, although it is safe to assume that all recent GPUs (GPUs from 2007 on) with more than 256MB of graphics memory can be used to develop and run code written with CUDA C.
NVIDIA 提供的系统软件允许您的程序与支持 CUDA 的硬件进行通信。如果您已正确安装 NVIDIA GPU,则您的计算机上可能已安装此软件。确保您拥有最新的驱动程序总不会有什么坏处,因此我们建议您访问www.nvidia.com/cuda并单击“下载驱动程序”链接。选择与您计划进行开发的显卡和操作系统相匹配的选项。按照您选择的平台的安装说明进行操作后,您的系统将使用最新的 NVIDIA 系统软件进行更新。
NVIDIA provides system software that allows your programs to communicate with the CUDA-enabled hardware. If you have installed your NVIDIA GPU properly, you likely already have this software installed on your machine. It never hurts to ensure you have the most recent drivers, so we recommend that you visit www.nvidia.com/cuda and click the Download Drivers link. Select the options that match the graphics card and operating system on which you plan to do development. After following the installation instructions for the platform of your choice, your system will be up-to-date with the latest NVIDIA system software.
如果您有支持 CUDA 的 GPU 和 NVIDIA 的设备驱动程序,则可以运行已编译的 CUDA C 代码。这意味着您可以下载 CUDA 支持的应用程序,并且它们将能够在您的图形处理器上成功执行其代码。然而,我们假设您想要做的不仅仅是运行代码,因为否则,这本书并不是真正必要的。如果您想使用 CUDA C 为 NVIDIA GPU开发代码,您将需要额外的软件。但正如之前所承诺的,这些都不会花费你一分钱。
If you have a CUDA-enabled GPU and NVIDIA’s device driver, you are ready to run compiled CUDA C code. This means that you can download CUDA-powered applications, and they will be able to successfully execute their code on your graphics processor. However, we assume that you want to do more than just run code because, otherwise, this book isn’t really necessary. If you want to develop code for NVIDIA GPUs using CUDA C, you will need additional software. But as promised earlier, none of it will cost you a penny.
您将在下一章中了解这些详细信息,但由于您的 CUDA C 应用程序将在两个不同的处理器上进行计算,因此您将需要两个编译器。一种编译器将为您的 GPU 编译代码,另一种编译器将为您的 CPU 编译代码。 NVIDIA 为您的 GPU 代码提供编译器。与 NVIDIA 设备驱动程序一样,您可以在http://developer.nvidia.com/object/gpucomputing.html下载CUDA 工具包。点击CUDA Toolkit链接,进入下载页面,如图2.1所示。
You will learn these details in the next chapter, but since your CUDA C applications are going to be computing on two different processors, you are consequently going to need two compilers. One compiler will compile code for your GPU, and one will compile code for your CPU. NVIDIA provides the compiler for your GPU code. As with the NVIDIA device driver, you can download the CUDA Toolkit at http://developer.nvidia.com/object/gpucomputing.html. Click the CUDA Toolkit link to reach the download page shown in Figure 2.1.
Figure 2.1 The CUDA download page
系统将再次要求您从 32 位和 64 位版本的 Windows XP、Windows Vista、Windows 7、Linux 和 Mac OS 中选择平台。您需要从可用的下载中下载 CUDA 工具包才能构建本书中包含的代码示例。此外,我们鼓励您(尽管不是必需的)下载 GPU 计算 SDK 代码示例包,其中包含数十个有用的示例程序。本书不会介绍 GPU 计算 SDK 代码示例,但它们很好地补充了我们打算介绍的材料,并且与学习任何风格的编程一样,示例越多越好。您还应该注意,尽管本书中的几乎所有代码都可以在 Linux、Windows 和 Mac OS 平台上运行,但我们的应用程序是针对 Linux 和 Windows 的。如果您使用 Mac OS X,您将生活在危险之中并使用不受支持的代码示例。
You will again be asked to select your platform from among 32- and 64-bit versions of Windows XP, Windows Vista, Windows 7, Linux, and Mac OS. From the available downloads, you need to download the CUDA Toolkit in order to build the code examples contained in this book. Additionally, you are encouraged, although not required, to download the GPU Computing SDK code samples package, which contains dozens of helpful example programs. The GPU Computing SDK code samples will not be covered in this book, but they nicely complement the material we intend to cover, and as with learning any style of programming, the more examples, the better. You should also take note that although nearly all the code in this book will work on the Linux, Windows, and Mac OS platforms, we have targeted the applications toward Linux and Windows. If you are using Mac OS X, you will be living dangerously and using unsupported code examples.
正如我们提到的,您将需要一个用于 GPU 代码的编译器和一个用于 CPU 代码的编译器。如果您按照上一节中的建议下载并安装了 CUDA 工具包,那么您就拥有了 GPU 代码的编译器。 CPU 代码编译器是我们 CUDA 清单上唯一保留的组件,因此让我们解决这个问题,以便我们能够找到有趣的东西。
As we mentioned, you will need a compiler for GPU code and a compiler for CPU code. If you downloaded and installed the CUDA Toolkit as suggested in the previous section, you have a compiler for GPU code. A compiler for CPU code is the only component that remains on our CUDA checklist, so let’s address that issue so we can get to the interesting stuff.
在 Microsoft Windows 平台上,包括 Windows XP、Windows Vista、Windows Server 2008 和 Windows 7,我们建议使用 Microsoft Visual Studio C 编译器。 NVIDIA 目前支持 Visual Studio 2005 和 Visual Studio 2008 系列产品。随着 Microsoft 发布新版本,NVIDIA 可能会增加对 Visual Studio 新版本的支持,同时放弃对旧版本的支持。许多 C 和 C++ 开发人员已在其计算机上安装了 Visual Studio 2005 或 Visual Studio 2008,因此如果这适用于您,则可以安全地跳过本小节。
On Microsoft Windows platforms, including Windows XP, Windows Vista, Windows Server 2008, and Windows 7, we recommend using the Microsoft Visual Studio C compiler. NVIDIA currently supports both the Visual Studio 2005 and Visual Studio 2008 families of products. As Microsoft releases new versions, NVIDIA will likely add support for newer editions of Visual Studio while dropping support for older versions. Many C and C++ developers already have Visual Studio 2005 or Visual Studio 2008 installed on their machine, so if this applies to you, you can safely skip this subsection.
如果您无法访问受支持的 Visual Studio 版本并且不准备购买副本,Microsoft 会在其网站上提供 Visual Studio 2008 Express 版本的免费下载。虽然 Visual Studio Express 版本通常不适合商业软件开发,但它是在 Windows 平台上开始开发 CUDA C 的绝佳方法,而无需在软件许可证上投资。因此,如果您需要 Visual Studio 2008,请访问www.microsoft.com/visualstudio !
If you do not have access to a supported version of Visual Studio and aren’t ready to invest in a copy, Microsoft does provide free downloads of the Visual Studio 2008 Express edition on its website. Although typically unsuitable for commercial software development, the Visual Studio Express editions are an excellent way to get started developing CUDA C on Windows platforms without investing money in software licenses. So, head on over to www.microsoft.com/visualstudio if you’re in need of Visual Studio 2008!
大多数 Linux 发行版通常都安装了 GNU C 编译器 ( ) 版本gcc。从 CUDA 3.0 开始,以下 Linux 发行版附带了受支持的已gcc安装版本:
Most Linux distributions typically ship with a version of the GNU C compiler (gcc) installed. As of CUDA 3.0, the following Linux distributions shipped with supported versions of gcc installed:
• 红帽企业 Linux 4.8
• Red Hat Enterprise Linux 4.8
• 红帽企业 Linux 5.3
• Red Hat Enterprise Linux 5.3
• OpenSUSE 11.1
• OpenSUSE 11.1
• SUSE Linux 企业桌面 11
• SUSE Linux Enterprise Desktop 11
乌班图9.04
• Ubuntu 9.04
• 费多拉 10
• Fedora 10
如果您是一名顽固的 Linux 用户,您可能会意识到许多 Linux 软件包不仅仅可以在“受支持”的平台上运行。 CUDA 工具包也不例外,因此即使此处未列出您最喜欢的发行版,也可能值得尝试一下。发行版的内核、版本gcc和glibc版本将在很大程度上决定发行版是否兼容。
If you’re a die-hard Linux user, you’re probably aware that many Linux software packages work on far more than just the “supported” platforms. The CUDA Toolkit is no exception, so even if your favorite distribution is not listed here, it may be worth trying it anyway. The distribution’s kernel, gcc, and glibc versions will in a large part determine whether the distribution is compatible.
如果您想在 Mac OS X 上进行开发,则需要确保您的计算机至少具有 Mac OS X 10.5.7 版本。其中包括版本 10.6、Mac OS X“Snow Leopard”。此外,您需要gcc通过下载并安装 Apple 的 Xcode 进行安装。该软件免费提供给 Apple Developer Connection (ADC) 成员,可以从http://developer.apple.com/tools/Xcode下载。本书中的代码是在 Linux 和 Windows 平台上开发的,但无需修改即可在 Mac OS X 系统上运行。
If you want to develop on Mac OS X, you will need to ensure that your machine has at least version 10.5.7 of Mac OS X. This includes version 10.6, Mac OS X “Snow Leopard.” Furthermore, you will need to install gcc by downloading and installing Apple’s Xcode. This software is provided free to Apple Developer Connection (ADC) members and can be downloaded from http://developer.apple.com/tools/Xcode. The code in this book was developed on Linux and Windows platforms but should work without modification on Mac OS X systems.
如果您已按照本章中的步骤进行操作,那么您就可以开始在 CUDA C 中开发代码了。也许您甚至已经尝试过从 NVIDIA 网站下载的一些 NVIDIA GPU 计算 SDK 代码示例。如果是这样,我们对您修补的意愿表示赞赏!如果没有,别担心。你需要的一切都在这本书里。不管怎样,您可能已经准备好开始用 CUDA C 编写您的第一个程序了,所以让我们开始吧。
If you have followed the steps in this chapter, you are ready to start developing code in CUDA C. Perhaps you have even played around with some of the NVIDIA GPU Computing SDK code samples you downloaded from NVIDIA’s website. If so, we applaud your willingness to tinker! If not, don’t worry. Everything you need is right here in this book. Either way, you’re probably ready to start writing your first program in CUDA C, so let’s get started.
如果您阅读了第 1 章,我们希望我们已经让您相信图形处理器具有巨大的计算能力,并且您只是利用它的程序员。如果您继续阅读第 2 章,您应该设置一个功能环境,以便编译和运行您将在 CUDA C 中编写的代码。如果您跳过了前几章,也许您只是在浏览代码示例,也许您在书店浏览时偶然打开了此页面,或者您只是迫不及待地想要开始;这也没关系(我们不会告诉)。无论哪种方式,您都已准备好开始使用第一个代码示例,所以让我们开始吧。
If you read Chapter 1, we hope we have convinced you of both the immense computational power of graphics processors and that you are just the programmer to harness it. And if you continued through Chapter 2, you should have a functioning environment set up in order to compile and run the code you’ll be writing in CUDA C. If you skipped the first chapters, perhaps you’re just skimming for code samples, perhaps you randomly opened to this page while browsing at a bookstore, or maybe you’re just dying to get started; that’s OK, too (we won’t tell). Either way, you’re ready to get started with the first code examples, so let’s go.
通过本章的课程,您将完成以下任务:
Through the course of this chapter, you will accomplish the following:
• 您将在 CUDA C 中编写第一行代码。
• You will write your first lines of code in CUDA C.
• 您将了解为主机编写的代码和为设备编写的代码之间的区别。
• You will learn the difference between code written for the host and code written for a device.
• 您将学习如何从主机运行设备代码。
• You will learn how to run device code from the host.
• 您将了解如何在支持CUDA 的设备上使用设备内存。
• You will learn about the ways device memory can be used on CUDA-capable devices.
• 您将学习如何查询系统以获取
有关支持CUDA 的设备的信息。
• You will learn how to query your system for
information on its CUDA-capable devices.
由于我们打算通过示例来学习 CUDA C,因此让我们看一下 CUDA C 的第一个示例。根据管理计算机编程书面作品的法律,我们首先检查“Hello, World!”。例子。
Since we intend to learn CUDA C by example, let’s take a look at our first example of CUDA C. In accordance with the laws governing written works of computer programming, we begin by examining a “Hello, World!” example.
说到这里,你无疑想知道这本书是否是一个骗局。这只是C吗? CUDA C 存在吗?这些问题的答案都是肯定的;这本书并不是一个精心设计的诡计。这个简单的“你好,世界!”例子是旨在说明,从根本上来说,CUDA C 和您已经习惯的标准 C 之间没有区别。
At this point, no doubt you’re wondering whether this book is a scam. Is this just C? Does CUDA C even exist? The answers to these questions are both in the affirmative; this book is not an elaborate ruse. This simple “Hello, World!” example is meant to illustrate that, at its most basic, there is no difference between CUDA C and the standard C to which you have grown accustomed.
该示例的简单性源于它完全在主机上运行的事实。这将是本书的重要区别之一。我们将CPU和系统内存称为主机,将GPU及其内存称为设备。此示例几乎类似于您曾经编写过的所有代码,因为它只是忽略主机外部的任何计算设备。
The simplicity of this example stems from the fact that it runs entirely on the host. This will be one of the important distinctions made in this book; we refer to the CPU and the system’s memory as the host and refer to the GPU and its memory as the device. This example resembles almost all the code you have ever written because it simply ignores any computing devices outside the host.
为了弥补这种沉闷的感觉,即您投入的只不过是一堆昂贵的琐事,我们将逐步以这个简单的示例为基础。让我们看一下使用 GPU(设备)来执行代码的东西。在设备上执行的函数通常称为内核。
To remedy that sinking feeling that you’ve invested in nothing more than an expensive collection of trivialities, we will gradually build upon this simple example. Let’s look at something that uses the GPU (a device) to execute code. A function that executes on the device is typically called a kernel.
现在,我们将在示例的基础上使用一些代码,这些代码看起来应该比普通的“Hello,World!”更陌生。程序。
Now we will build upon our example with some code that should look more foreign than our plain-vanilla “Hello, World!” program.
该程序对原来的“Hello, World!”做了两个值得注意的补充。例子:
This program makes two notable additions to the original “Hello, World!” example:
kernel()• 名为qualified with 的空函数__global__
• An empty function named kernel() qualified with __global__
• 对空函数的调用,修饰为<<<1,1>>>
• A call to the empty function, embellished with <<<1,1>>>
正如我们在上一节中看到的,默认情况下,代码由系统的标准 C 编译器编译。例如,GNUgcc可能会编译您的主机代码在Linux操作系统上编译它,而Microsoft Visual C在Windows系统上编译它。 NVIDIA 工具只需向该主机编译器提供您的代码,一切都会像没有 CUDA 的世界一样运行。
As we saw in the previous section, code is compiled by your system’s standard C compiler by default. For example, GNU gcc might compile your host code on Linux operating systems, while Microsoft Visual C compiles it on Windows systems. The NVIDIA tools simply feed this host compiler your code, and everything behaves as it would in a world without CUDA.
现在我们看到 CUDA C__global__向标准 C 添加了限定符。这种机制提醒编译器应将函数编译为在设备而不是主机上运行。在这个简单的示例中,nvcc将函数提供kernel()给处理设备代码的编译器,并将其提供main()给主机编译器,就像在上一个示例中所做的那样。
Now we see that CUDA C adds the __global__ qualifier to standard C. This mechanism alerts the compiler that a function should be compiled to run on a device instead of the host. In this simple example, nvcc gives the function kernel() to the compiler that handles device code, and it feeds main() to the host compiler as it did in the previous example.
那么,对 的神秘调用是什么kernel(),为什么我们必须用尖括号和数字元组破坏我们的标准 C?做好准备,因为这就是奇迹发生的地方。
So, what is the mysterious call to kernel(), and why must we vandalize our standard C with angle brackets and a numeric tuple? Brace yourself, because this is where the magic happens.
我们已经看到 CUDA C 需要一种语言方法来将函数标记为设备代码。这并没有什么特别的;它是将主机代码发送到一个编译器并将设备代码发送到另一个编译器的简写。诀窍实际上是从主机代码调用设备代码。 CUDA C 的好处之一是它提供了这种语言集成,因此设备函数调用看起来非常像主机函数调用。稍后我们将讨论幕后实际发生的情况,但足以说明 CUDA 编译器和运行时处理从主机调用设备代码的混乱事务。
We have seen that CUDA C needed a linguistic method for marking a function as device code. There is nothing special about this; it is shorthand to send host code to one compiler and device code to another compiler. The trick is actually in calling the device code from the host code. One of the benefits of CUDA C is that it provides this language integration so that device function calls look very much like host function calls. Later we will discuss what actually happens behind the scenes, but suffice to say that the CUDA compiler and runtime take care of the messy business of invoking device code from the host.
那么,这个看似神秘的调用调用了设备代码,但为什么要使用尖括号和数字呢?尖括号表示我们计划传递给运行时系统的参数。这些不是设备代码的参数,而是影响运行时启动设备代码的方式的参数。我们将在下一章中了解运行时的这些参数。设备代码本身的参数在括号内传递,就像任何其他函数调用一样。
So, the mysterious-looking call invokes device code, but why the angle brackets and numbers? The angle brackets denote arguments we plan to pass to the runtime system. These are not arguments to the device code but are parameters that will influence how the runtime will launch our device code. We will learn about these parameters to the runtime in the next chapter. Arguments to the device code itself get passed within the parentheses, just like any other function invocation.
我们已经承诺能够将参数传递给我们的内核,现在是我们兑现这一承诺的时候了。考虑对我们的“Hello, World!”进行以下增强应用:
We’ve promised the ability to pass parameters to our kernel, and the time has come for us to make good on that promise. Consider the following enhancement to our “Hello, World!” application:
您会注意到这里有一些新行,但这些更改仅引入了两个概念:
You will notice a handful of new lines here, but these changes introduce only two concepts:
• 我们可以像使用任何C 函数一样将参数传递给内核。
• We can pass parameters to a kernel as we would with any C function.
• 我们需要分配内存来在设备上执行任何有用的操作,例如向主机返回值。
• We need to allocate memory to do anything useful on a device, such as return values to the host.
将参数传递给内核没有什么特别的。尽管有尖括号语法,但内核调用的外观和行为与标准 C 中的任何函数调用完全相同。运行时系统负责处理由于这些参数需要从主机获取到设备而引入的任何复杂性。
There is nothing special about passing parameters to a kernel. The angle-bracket syntax notwithstanding, a kernel call looks and acts exactly like any function call in standard C. The runtime system takes care of any complexity introduced by the fact that these parameters need to get from the host to the device.
更有趣的补充是使用cudaMalloc().此调用的行为与标准 C 调用非常相似malloc(),但它告诉 CUDA 运行时在设备上分配内存。第一个参数是指向您要保存新分配的内存地址的指针,第二个参数是您要进行分配的大小。除了分配的内存指针不是函数的返回值这一事实之外,这与malloc()返回void*类型相同。围绕HANDLE_ERROR()这些调用的是一个实用宏,我们作为本书支持代码的一部分提供了它。它只是检测调用是否返回了错误,打印关联的错误消息,并使用代码退出应用程序EXIT_FAILURE。尽管您可以在自己的应用程序中自由使用此代码,但此错误处理代码很可能在生产代码中不够用。
The more interesting addition is the allocation of memory using cudaMalloc(). This call behaves very similarly to the standard C call malloc(), but it tells the CUDA runtime to allocate the memory on the device. The first argument is a pointer to the pointer you want to hold the address of the newly allocated memory, and the second parameter is the size of the allocation you want to make. Besides the fact that your allocated memory pointer is not the function’s return value, this is identical behavior to malloc(), right down to the void* return type. The HANDLE_ERROR() that surrounds these calls is a utility macro that we have provided as part of this book’s support code. It simply detects that the call has returned an error, prints the associated error message, and exits the application with an EXIT_FAILURE code. Although you are free to use this code in your own applications, it is highly likely that this error-handling code will be insufficient in production code.
这提出了一个微妙但重要的观点。 CUDA C 的简单性和强大功能很大程度上源于模糊主机和设备代码之间界限的能力。但是,程序员有责任不取消引用从cudaMalloc()主机上执行的代码返回的指针。主机代码可以传递这个指针,对其执行算术,甚至将其转换为不同的类型。但你不能用它来读取或写入内存。
This raises a subtle but important point. Much of the simplicity and power of CUDA C derives from the ability to blur the line between host and device code. However, it is the responsibility of the programmer not to dereference the pointer returned by cudaMalloc() from code that executes on the host. Host code may pass this pointer around, perform arithmetic on it, or even cast it to a different type. But you cannot use it to read or write from memory.
不幸的是,编译器也无法保护您免受此错误的影响。它会非常乐意允许在主机代码中取消引用设备指针,因为它看起来像应用程序中的任何其他指针。我们可以将设备指针的使用限制总结如下:
Unfortunately, the compiler cannot protect you from this mistake, either. It will be perfectly happy to allow dereferences of device pointers in your host code because it looks like any other pointer in the application. We can summarize the restrictions on the usage of device pointers as follows:
您可以将分配的指针传递给cudaMalloc()在设备上执行的函数。
You can pass pointers allocated with cudaMalloc() to functions that execute on the device.
您可以使用分配的指针cudaMalloc()从设备上执行的代码中读取或写入内存。
You can use pointers allocated with cudaMalloc() to read or write memory from code that executes on the device.
您可以将分配的指针传递给cudaMalloc()在主机上执行的函数。
You can pass pointers allocated with cudaMalloc() to functions that execute on the host.
您不能使用分配的指针cudaMalloc()从主机上执行的代码中读取或写入内存。
You cannot use pointers allocated with cudaMalloc() to read or write memory from code that executes on the host.
如果您仔细阅读,您可能已经预料到下一课:我们不能使用标准 Cfree()函数来释放我们分配的内存cudaMalloc()。为了释放我们分配的内存cudaMalloc(),我们需要使用对 的调用cudaFree(),其行为与 完全相同free()。
If you’ve been reading carefully, you might have anticipated the next lesson: We can’t use standard C’s free() function to release memory we’ve allocated with cudaMalloc(). To free memory we’ve allocated with cudaMalloc(), we need to use a call to cudaFree(), which behaves exactly like free() does.
我们已经了解了如何使用主机在设备上分配和释放内存,但我们也痛苦地明确表示您无法从主机修改此内存。示例程序的其余两行说明了访问设备内存的两种最常见的方法 - 通过使用设备代码中的设备指针以及通过使用对cudaMemcpy().
We’ve seen how to use the host to allocate and free memory on the device, but we’ve also made it painfully clear that you cannot modify this memory from the host. The remaining two lines of the sample program illustrate two of the most common methods for accessing device memory—by using device pointers from within device code and by using calls to cudaMemcpy().
我们在设备代码中使用指针的方式与在主机代码上运行的标准 C 中使用指针的方式完全相同。该声明*c = a + b就像看起来一样简单。它将参数a和b相加,并将结果存储在 指向的内存中c。我们希望这太简单了,甚至变得有趣。
We use pointers from within device code exactly the same way we use them in standard C that runs on the host code. The statement *c = a + b is as simple as it looks. It adds the parameters a and b together and stores the result in the memory pointed to by c. We hope this is almost too easy to even be interesting.
我们列出了可以和不能在设备和主机代码中使用设备指针的方式。这些注意事项与人们在考虑主机指针时所想象的一模一样。尽管我们可以在设备代码中自由地传递主机指针,但是当我们尝试使用主机指针从设备代码中访问内存时,我们会遇到麻烦。总而言之,主机指针可以从主机代码访问内存,设备指针可以从设备代码访问内存。
We listed the ways in which we can and cannot use device pointers from within device and host code. These caveats translate exactly as one might imagine when considering host pointers. Although we are free to pass host pointers around in device code, we run into trouble when we attempt to use a host pointer to access memory from within device code. To summarize, host pointers can access memory from host code, and device pointers can access memory from device code.
cudaMemcpy()正如所承诺的,我们还可以通过主机代码的调用来访问设备上的内存。这些调用的行为与标准 C 完全相同,memcpy()带有一个附加参数来指定源指针和目标指针中的哪一个指向设备内存。在示例中,请注意最后一个参数是cudaMemcpy(),cudaMemcpyDeviceToHost指示运行时源指针是设备指针,目标指针是主机指针。
As promised, we can also access memory on a device through calls to cudaMemcpy() from host code. These calls behave exactly like standard C memcpy() with an additional parameter to specify which of the source and destination pointers point to device memory. In the example, notice that the last parameter to cudaMemcpy() is cudaMemcpyDeviceToHost, instructing the runtime that the source pointer is a device pointer and the destination pointer is a host pointer.
毫不奇怪,cudaMemcpyHostToDevice将指示相反的情况,其中源数据位于主机上,目标是设备上的地址。最后,我们甚至可以通过传递 来指定两个指针都在设备上cudaMemcpyDeviceToDevice。如果源指针和目标指针都在主机上,我们只需使用标准 C 的memcpy()例程在它们之间进行复制即可。
Unsurprisingly, cudaMemcpyHostToDevice would indicate the opposite situation, where the source data is on the host and the destination is an address on the device. Finally, we can even specify that both pointers are on the device by passing cudaMemcpyDeviceToDevice. If the source and destination pointers are both on the host, we would simply use standard C’s memcpy() routine to copy between them.
由于我们希望在设备上分配内存并执行代码,因此如果我们的程序能够了解设备有多少内存以及设备具有哪些类型的功能,那将会很有用。此外,比较常见的是人们在每台计算机上拥有多个支持 CUDA 的设备。在这种情况下,我们肯定需要一种方法来确定哪个处理器是哪个。
Since we would like to be allocating memory and executing code on our device, it would be useful if our program had a way of knowing how much memory and what types of capabilities the device had. Furthermore, it is relatively common for people to have more than one CUDA-capable device per computer. In situations like this, we will definitely want a way to determine which processor is which.
例如,许多主板都配备了集成 NVIDIA 图形处理器。当制造商或用户向该计算机添加独立图形处理器时,它就拥有两个支持 CUDA 的处理器。某些 NVIDIA 产品(例如 GeForce GTX 295)在一张卡上配备了两个 GPU。包含此类产品的计算机还将显示两个支持 CUDA 的处理器。
For example, many motherboards ship with integrated NVIDIA graphics processors. When a manufacturer or user adds a discrete graphics processor to this computer, it then possesses two CUDA-capable processors. Some NVIDIA products, like the GeForce GTX 295, ship with two GPUs on a single card. Computers that contain products such as this will also show two CUDA-capable processors.
在我们深入编写设备代码之前,我们希望有一种机制来确定存在哪些设备(如果有)以及每个设备支持哪些功能。幸运的是,有一个非常简单的界面可以确定这些信息。首先,我们想知道系统中有多少设备是基于 CUDA 架构构建的。这些设备将能够执行用 CUDA C 编写的内核。要获取 CUDA 设备的数量,我们调用cudaGetDeviceCount()。不用说,我们预计会获得最具创意功能名称奖。
Before we get too deep into writing device code, we would love to have a mechanism for determining which devices (if any) are present and what capabilities each device supports. Fortunately, there is a very easy interface to determine this information. First, we will want to know how many devices in the system were built on the CUDA Architecture. These devices will be capable of executing kernels written in CUDA C. To get the count of CUDA devices, we call cudaGetDeviceCount(). Needless to say, we anticipate receiving an award for Most Creative Function Name.
调用后cudaGetDeviceCount(),我们可以遍历设备并查询每个设备的相关信息。 CUDA 运行时以 类型的结构返回给我们这些属性cudaDeviceProp。我们可以检索什么样的属性?从 CUDA 3.0 开始,该cudaDeviceProp结构包含以下内容:
After calling cudaGetDeviceCount(), we can then iterate through the devices and query relevant information about each. The CUDA runtime returns us these properties in a structure of type cudaDeviceProp. What kind of properties can we retrieve? As of CUDA 3.0, the cudaDeviceProp structure contains the following:
其中一些是不言自明的;其他有一些附加说明(见表3.1)。
Some of these are self-explanatory; others bear some additional description (see Table 3.1).
Table 3.1 CUDA Device Properties
我们希望避免在兔子洞中走得太远、太快,因此我们现在不会详细介绍这些属性。事实上,前面的列表缺少有关其中一些属性的一些重要细节,因此您需要查阅NVIDIA CUDA 参考手册以获取更多信息。当您继续编写自己的应用程序时,这些属性将证明非常有用。不过,现在我们将简单地展示如何查询每个设备并报告每个设备的属性。到目前为止,我们的设备查询看起来像这样:
We’d like to avoid going too far, too fast down our rabbit hole, so we will not go into extensive detail about these properties now. In fact, the previous list is missing some important details about some of these properties, so you will want to consult the NVIDIA CUDA Reference Manual for more information. When you move on to write your own applications, these properties will prove extremely useful. However, for now we will simply show how to query each device and report the properties of each. So far, our device query looks something like this:
现在我们知道了每个可用的字段,我们可以扩展模糊的“做某事...”部分并实现一些稍微不那么琐碎的事情:
Now that we know each of the fields available to us, we can expand on the ambiguous “Do something...” section and implement something marginally less trivial:
除了编写一个应用程序来轻松打印每个支持 CUDA 的卡的每个细节之外,我们为什么会对系统中每个设备的属性感兴趣?由于我们作为软件开发人员希望每个人都认为我们的软件速度很快,因此我们可能有兴趣选择具有最多多处理器的 GPU 来运行我们的代码。或者,如果内核需要与 CPU 密切交互,我们可能有兴趣在与 CPU 共享系统内存的集成 GPU 上运行代码。这些都是我们可以使用 查询的属性cudaGetDeviceProperties()。
Other than writing an application that handily prints every detail of every CUDA-capable card, why might we be interested in the properties of each device in our system? Since we as software developers want everyone to think our software is fast, we might be interested in choosing the GPU with the most multiprocessors on which to run our code. Or if the kernel needs close interaction with the CPU, we might be interested in running our code on the integrated GPU that shares system memory with the CPU. These are both properties we can query with cudaGetDeviceProperties().
假设我们正在编写一个依赖于双精度浮点支持的应用程序。快速查阅NVIDIA CUDA 编程指南的附录 A 后,我们知道具有 1.3 或更高计算能力的卡支持双精度浮点数学。因此,为了成功运行我们编写的双精度应用程序,我们需要找到至少一台计算能力为 1.3 或更高的设备。
Suppose that we are writing an application that depends on having double-precision floating-point support. After a quick consultation with Appendix A of the NVIDIA CUDA Programming Guide, we know that cards that have compute capability 1.3 or higher support double-precision floating-point math. So to successfully run the double-precision application that we’ve written, we need to find at least one device of compute capability 1.3 or higher.
根据我们在cudaGetDeviceCount()和中看到的内容cudaGetDeviceProperties(),我们可以迭代每个设备并查找主版本大于 1 或主版本为 1 且次版本大于或等于 3 的设备。但由于这种情况相对常见该过程执行起来也相对烦人,CUDA 运行时为我们提供了一种自动化的方法来执行此操作。我们首先cudaDeviceProp用我们需要设备具备的属性填充一个结构。
Based on what we have seen with cudaGetDeviceCount() and cudaGetDeviceProperties(), we could iterate through each device and look for one that either has a major version greater than 1 or has a major version of 1 and minor version greater than or equal to 3. But since this relatively common procedure is also relatively annoying to perform, the CUDA runtime offers us an automated way to do this. We first fill a cudaDeviceProp structure with the properties we need our device to have.
填充cudaDeviceProp结构后,我们将其传递给cudaChooseDevice()CUDA 运行时,以查找满足此约束的设备。调用cudaChooseDevice()返回一个设备 ID,然后我们可以将其传递给cudaSetDevice().从现在开始,所有设备操作都将在我们在 中找到的设备上进行cudaChooseDevice()。
After filling a cudaDeviceProp structure, we pass it to cudaChooseDevice() to have the CUDA runtime find a device that satisfies this constraint. The call to cudaChooseDevice() returns a device ID that we can then pass to cudaSetDevice(). From this point forward, all device operations will take place on the device we found in cudaChooseDevice().
具有多个 GPU 的系统变得越来越普遍。例如,许多 NVIDIA 的主板芯片组都包含集成的、支持 CUDA 的 GPU。当一个独立 GPU 添加到这些系统之一时,您突然就拥有了一个多 GPU 平台。此外,NVIDIA的SLI技术允许多个独立GPU并排安装。在这两种情况下,您的应用程序可能会优先选择一个 GPU 而不是另一个 GPU。如果您的应用程序依赖于 GPU 的某些功能或依赖于系统中最快的 GPU,您应该熟悉此 API,因为不能保证 CUDA 运行时将为您的应用程序选择最好或最合适的 GPU。
Systems with multiple GPUs are becoming more and more common. For example, many of NVIDIA’s motherboard chipsets contain integrated, CUDA-capable GPUs. When a discrete GPU is added to one of these systems, you suddenly have a multi-GPU platform. Moreover, NVIDIA’s SLI technology allows multiple discrete GPUs to be installed side by side. In either of these cases, your application may have a preference of one GPU over another. If your application depends on certain features of the GPU or depends on having the fastest GPU in the system, you should familiarize yourself with this API because there is no guarantee that the CUDA runtime will choose the best or most appropriate GPU for your application.
我们终于开始动手编写 CUDA C,理想情况下,它没有您想象的那么痛苦。从根本上来说,CUDA C 是带有一些修饰的标准 C,允许我们指定哪些代码应该在设备上运行,哪些代码应该在主机上运行。通过__global__在函数前添加关键字,我们向编译器表明我们打算在 GPU 上运行该函数。为了使用GPU的专用内存,我们还学习了类似于C的malloc()、、memcpy()和free()API的CUDA API。这些函数的 CUDA 版本 、cudaMalloc()、cudaMemcpy()和cudaFree()允许我们分配设备内存、在设备和主机之间复制数据,并在使用完毕后释放设备内存。
We’ve finally gotten our hands dirty writing CUDA C, and ideally it has been less painful than you might have suspected. Fundamentally, CUDA C is standard C with some ornamentation to allow us to specify which code should run on the device and which should run on the host. By adding the keyword __global__ before a function, we indicated to the compiler that we intend to run the function on the GPU. To use the GPU’s dedicated memory, we also learned a CUDA API similar to C’s malloc(), memcpy(), and free() APIs. The CUDA versions of these functions, cudaMalloc(), cudaMemcpy(), and cudaFree(), allow us to allocate device memory, copy data between the device and host, and free the device memory when we’ve finished with it.
随着本书的进展,我们将看到更多有趣的示例,说明如何有效地将设备用作大规模并行协处理器。现在,您应该知道开始使用 CUDA C 是多么容易,在下一章中我们将看到在 GPU 上执行并行代码是多么容易。
As we progress through this book, we will see more interesting examples of how we can effectively use the device as a massively parallel coprocessor. For now, you should know how easy it is to get started with CUDA C, and in the next chapter we will see how easy it is to execute parallel code on the GPU.
在上一章中,我们看到编写在 GPU 上执行的代码是多么简单。我们甚至还学习了如何将两个数字相加,尽管只是数字 2 和 7。不可否认,这个例子并不令人印象深刻,也不是非常有趣。但我们希望您确信 CUDA C 入门很容易,并且您很高兴了解更多信息。 GPU 计算的大部分前景在于利用许多问题的大规模并行结构。本着这一精神,我们打算用本章来研究如何使用 CUDA C 在 GPU 上执行并行代码。
In the previous chapter, we saw how simple it can be to write code that executes on the GPU. We have even gone so far as to learn how to add two numbers together, albeit just the numbers 2 and 7. Admittedly, that example was not immensely impressive, nor was it incredibly interesting. But we hope you are convinced that it is easy to get started with CUDA C and you’re excited to learn more. Much of the promise of GPU computing lies in exploiting the massively parallel structure of many problems. In this vein, we intend to spend this chapter examining how to execute parallel code on the GPU using CUDA C.
通过本章的课程,您将完成以下任务:
Through the course of this chapter, you will accomplish the following:
• 您将了解CUDA 公开其并行性的基本方法之一。
• You will learn one of the fundamental ways CUDA exposes its parallelism.
• 您将使用 CUDA C 编写第一个并行代码。
• You will write your first parallel code with CUDA C.
之前,我们看到了让标准 C 函数在设备上开始运行是多么容易。通过__global__向函数添加限定符并使用特殊的尖括号语法调用它,我们在 GPU 上执行了该函数。虽然这非常简单,但效率也非常低,因为 NVIDIA 的硬件工程人员已经优化了他们的图形处理器,可以并行执行数百个计算。然而,到目前为止,我们只启动了一个在 GPU 上串行运行的内核。在本章中,我们将看到启动并行执行计算的设备内核是多么简单。
Previously, we saw how easy it was to get a standard C function to start running on a device. By adding the __global__ qualifier to the function and by calling it using a special angle bracket syntax, we executed the function on our GPU. Although this was extremely simple, it was also extremely inefficient because NVIDIA’s hardware engineering minions have optimized their graphics processors to perform hundreds of computations in parallel. However, thus far we have only ever launched a kernel that runs serially on the GPU. In this chapter, we see how straightforward it is to launch a device kernel that performs its computations in parallel.
我们将设计一个简单的示例来说明线程以及如何使用它们通过 CUDA C 进行编码。想象一下有两个数字列表,我们希望对每个列表的相应元素求和并将结果存储在第三个列表中。图 4.1显示了这个过程。如果您有线性代数背景,您就会知道此运算是将两个向量相加。
We will contrive a simple example to illustrate threads and how we use them to code with CUDA C. Imagine having two lists of numbers where we want to sum corresponding elements of each list and store the result in a third list. Figure 4.1 shows this process. If you have any background in linear algebra, you will recognize this operation as summing two vectors.
Figure 4.1 Summing two vectors
首先,我们将了解使用传统 C 代码完成此添加的一种方法:
First we’ll look at one way this addition can be accomplished with traditional C code:
这个例子的大部分内容几乎没有解释,但我们将简要地看一下这个add()函数,以解释为什么我们把它过于复杂化。
Most of this example bears almost no explanation, but we will briefly look at the add() function to explain why we overly complicated it.
while我们在索引tid范围为0到 的循环内计算总和N-1。我们将a[]和的相应元素相加b[],将结果放入 的相应元素中c[]。人们通常会以稍微简单的方式对此进行编码,如下所示:
We compute the sum within a while loop where the index tid ranges from 0 to N-1. We add corresponding elements of a[] and b[], placing the result in the corresponding element of c[]. One would typically code this in a slightly simpler manner, like so:
我们稍微复杂一点的方法旨在提出一种在具有多个 CPU 或 CPU 内核的系统上并行化代码的潜在方法。例如,对于双核处理器,可以将增量更改为 2,并让一个内核使用 初始化循环,tid = 0另一个内核使用 初始化循环tid = 1。第一个核心将添加偶数索引元素,第二个核心将添加奇数索引元素。这相当于在两个 CPU 核心上分别执行以下代码:
Our slightly more convoluted method was intended to suggest a potential way to parallelize the code on a system with multiple CPUs or CPU cores. For example, with a dual-core processor, one could change the increment to 2 and have one core initialize the loop with tid = 0 and another with tid = 1. The first core would add the even-indexed elements, and the second core would add the odd-indexed elements. This amounts to executing the following code on each of the two CPU cores:
当然,在 CPU 上执行此操作需要的代码比我们在本示例中包含的代码多得多。您需要提供合理数量的基础设施来创建执行该函数的工作线程add(),并假设每个线程将并行执行,不幸的是,这种调度假设并不总是正确的。
Of course, doing this on a CPU would require considerably more code than we have included in this example. You would need to provide a reasonable amount of infrastructure to create the worker threads that execute the function add() as well as make the assumption that each thread would execute in parallel, a scheduling assumption that is unfortunately not always true.
add()我们可以通过编写为设备函数在 GPU 上非常相似地完成相同的加法。这应该与您在上一章中看到的代码类似。但在查看设备代码之前,我们先介绍main().尽管 GPU 实现main()与相应的 CPU 版本不同,但这里看起来应该没有什么新内容:
We can accomplish the same addition very similarly on a GPU by writing add() as a device function. This should look similar to code you saw in the previous chapter. But before we look at the device code, we present main(). Although the GPU implementation of main() is different from the corresponding CPU version, nothing here should look new:
您会注意到我们再次使用的一些常见模式:
You will notice some common patterns that we employ again:
• 我们使用对 的调用在设备上分配三个数组cudaMalloc():两个数组dev_a和dev_b来保存输入,一个数组dev_c和 来保存结果。
• We allocate three arrays on the device using calls to cudaMalloc(): two arrays, dev_a and dev_b, to hold inputs, and one array, dev_c, to hold the result.
• 因为我们是有环保意识的编码员,所以我们用 进行清理cudaFree()。
• Because we are environmentally conscientious coders, we clean up after ourselves with cudaFree().
• 使用时cudaMemcpy(),我们使用参数将输入数据复制到设备,cudaMemcpyHostToDevice并使用参数将结果数据复制回主机cudaMemcpyDeviceToHost。
• Using cudaMemcpy(), we copy the input data to the device with the parameter cudaMemcpyHostToDevice and copy the result data back to the host with cudaMemcpyDeviceToHost.
• 我们使用三尖括号语法add()从主机代码执行设备代码。main()
• We execute the device code in add() from the host code in main() using the triple angle bracket syntax.
顺便说一句,您可能想知道为什么我们要填充 CPU 上的输入数组。没有特别的理由说明为什么我们需要这样做。事实上,如果我们在 GPU 上填充数组,这一步的性能会更快。但我们打算展示如何在图形处理器上实现特定的操作,即两个向量的加法。因此,我们要求您想象这只是一个更大的应用程序的一个步骤,其中输入数组a[]和b[]是由其他算法生成的或由用户从硬盘驱动器加载的。总之,假装这些数据是凭空出现的,现在我们需要用它做点什么就足够了。
As an aside, you may be wondering why we fill the input arrays on the CPU. There is no reason in particular why we need to do this. In fact, the performance of this step would be faster if we filled the arrays on the GPU. But we intend to show how a particular operation, namely, the addition of two vectors, can be implemented on a graphics processor. As a result, we ask you to imagine that this is but one step of a larger application where the input arrays a[] and b[] have been generated by some other algorithm or loaded from the hard drive by the user. In summary, it will suffice to pretend that this data appeared out of nowhere and now we need to do something with it.
继续,我们的add()例程看起来与其相应的 CPU 实现类似:
Moving on, our add() routine looks similar to its corresponding CPU implementation:
我们再次看到该函数的常见模式add():
Again we see a common pattern with the function add():
• 我们编写了一个名为 的函数,add()该函数在设备上执行。我们通过使用 C 代码并向__global__函数名称添加限定符来实现此目的。
• We have written a function called add() that executes on the device. We accomplished this by taking C code and adding a __global__ qualifier to the function name.
到目前为止,这个示例除了可以做的不仅仅是将 2 和 7 相加之外,没有什么新内容。但是,这个示例有两个值得注意的部分: 三重尖括号内的参数和内核本身包含的代码都引入了新的内容概念。
So far, there is nothing new in this example except it can do more than add 2 and 7. However, there are two noteworthy components of this example: The parameters within the triple angle brackets and the code contained in the kernel itself both introduce new concepts.
到目前为止,我们总是看到以以下形式启动的内核:
Up to this point, we have always seen kernels launched in the following form:
内核<<<1,1>>>(参数1,参数2,...);
kernel<<<1,1>>>( param1, param2, . . . );
但在本例中,我们在尖括号中使用非 1 的数字来启动:
But in this example we are launching with a number in the angle brackets that is not 1:
添加<<<N,1>>>( dev _ a, dev _ b, dev _ c );
add<<<N,1>>>( dev _ a, dev _ b, dev _ c );
是什么赋予了?
What gives?
回想一下,我们没有解释尖括号中的这两个数字;我们含糊地说它们是运行时的参数,描述如何启动内核。好吧,这些参数中的第一个数字代表我们希望设备在其中执行内核的并行块的数量。在本例中,我们将传递此参数的值N。
Recall that we left those two numbers in the angle brackets unexplained; we stated vaguely that they were parameters to the runtime that describe how to launch the kernel. Well, the first number in those parameters represents the number of parallel blocks in which we would like the device to execute our kernel. In this case, we’re passing the value N for this parameter.
例如,如果我们使用 启动kernel<<<2,1>>>(),您可以认为运行时创建内核的两个副本并并行运行它们。我们将每个并行调用称为一个块。使用kernel<<<256,1>>>(),您将在 GPU 上运行256 个块。并行编程从未如此简单。
For example, if we launch with kernel<<<2,1>>>(), you can think of the runtime creating two copies of the kernel and running them in parallel. We call each of these parallel invocations a block. With kernel<<<256,1>>>(), you would get 256 blocks running on the GPU. Parallel programming has never been easier.
但这提出了一个很好的问题:GPU 运行N内核代码的副本,但我们如何从代码中判断当前正在运行哪个块?这个问题给我们带来了该示例的第二个新功能,即内核代码本身。具体来说,它给我们带来了变量blockIdx.x:
But this raises an excellent question: The GPU runs N copies of our kernel code, but how can we tell from within the code which block is currently running? This question brings us to the second new feature of the example, the kernel code itself. Specifically, it brings us to the variable blockIdx.x:
乍一看,这个变量应该会在编译时导致语法错误,因为我们用它来赋值tid,但我们从未定义过它。然而,不需要定义变量blockIdx;这是 CUDA 运行时为我们定义的内置变量之一。此外,我们使用这个变量的目的正是它听起来的含义。它包含当前正在运行设备代码的块的块索引值。
At first glance, it looks like this variable should cause a syntax error at compile time since we use it to assign the value of tid, but we have never defined it. However, there is no need to define the variable blockIdx; this is one of the built-in variables that the CUDA runtime defines for us. Furthermore, we use this variable for exactly what it sounds like it means. It contains the value of the block index for whichever block is currently running the device code.
那么你可能会问,为什么这不只是blockIdx?为什么blockIdx.x?事实证明,CUDA C 允许您在二维中定义一组块。对于二维域的问题,例如矩阵数学或图像处理,使用二维索引通常很方便,可以避免从线性索引到矩形索引的烦人转换。如果您不熟悉这些问题类型,请不要担心;只是知道使用二维索引有时比一维索引更方便。但你永远不必使用它。我们不会被冒犯。
Why, you may then ask, is it not just blockIdx? Why blockIdx.x? As it turns out, CUDA C allows you to define a group of blocks in two dimensions. For problems with two-dimensional domains, such as matrix math or image processing, it is often convenient to use two-dimensional indexing to avoid annoying translations from linear to rectangular indices. Don’t worry if you aren’t familiar with these problem types; just know that using two-dimensional indexing can sometimes be more convenient than one-dimensional indexing. But you never have to use it. We won’t be offended.
当我们启动内核时,我们指定N为并行块的数量。我们将并行块的集合称为网格。这向运行时系统指定我们需要一个一维的块网格N(标量值被解释为一维)。这些线程将具有不同的值blockIdx.x,第一个取值为 0,最后一个取值为N-1。因此,想象四个块,全部运行设备代码的相同副本,但变量 具有不同的值blockIdx.x。这是运行时替换适当的块索引后,在四个并行块中的每个块中执行的实际代码的样子blockIdx.x:
When we launched the kernel, we specified N as the number of parallel blocks. We call the collection of parallel blocks a grid. This specifies to the runtime system that we want a one-dimensional grid of N blocks (scalar values are interpreted as one-dimensional). These threads will have varying values for blockIdx.x, the first taking value 0 and the last taking value N-1. So, imagine four blocks, all running through the same copy of the device code but having different values for the variable blockIdx.x. This is what the actual code being executed in each of the four parallel blocks looks like after the runtime substitutes the appropriate block index for blockIdx.x:
如果您还记得我们开始时基于 CPU 的示例,您会记得我们需要遍历从 0 到 的索引,N-1以便对两个向量求和。由于运行时系统已经启动了一个内核,其中每个块都将具有这些索引之一,因此几乎所有这些工作都已经为我们完成了。因为我们都是懒惰的人,所以这是一件好事。它让我们有更多的时间写博客,可能是关于我们有多懒。
If you recall the CPU-based example with which we began, you will recall that we needed to walk through indices from 0 to N-1 in order to sum the two vectors. Since the runtime system is already launching a kernel where each block will have one of these indices, nearly all of this work has already been done for us. Because we’re something of a lazy lot, this is a good thing. It affords us more time to blog, probably about how lazy we are.
最后一个要回答的问题是,为什么我们要检查是否tid小于N?它应该总是小于N,因为我们已经专门启动了我们的内核,使得这个假设成立。但我们对懒惰的渴望也让我们对有人打破我们在代码中所做的假设感到偏执。破坏代码假设意味着破坏代码。这意味着错误报告迟到了晚上追踪不良行为,通常还有很多妨碍我们和博客的活动。如果我们不检查是否tid小于N并随后获取不属于我们的内存,这将是糟糕的。事实上,它可能会终止内核的执行,因为 GPU 具有复杂的内存管理单元,可以终止似乎违反内存规则的进程。
The last remaining question to be answered is, why do we check whether tid is less than N? It should always be less than N, since we’ve specifically launched our kernel such that this assumption holds. But our desire to be lazy also makes us paranoid about someone breaking an assumption we’ve made in our code. Breaking code assumptions means broken code. This means bug reports, late nights tracking down bad behavior, and generally lots of activities that stand between us and our blog. If we didn’t check that tid is less than N and subsequently fetched memory that wasn’t ours, this would be bad. In fact, it could possibly kill the execution of your kernel, since GPUs have sophisticated memory management units that kill processes that seem to be violating memory rules.
HANDLE_ERROR()如果您遇到类似刚刚提到的问题,我们在代码中大量散布的宏之一将检测并警告您这种情况。与传统的 C 编程一样,这里的教训是函数返回错误代码是有原因的。虽然忽略这些错误代码总是很诱人,但我们希望通过敦促您检查每个可能失败的操作的结果来节省您的痛苦时间。通常情况下,这些错误的存在不会阻止您继续执行应用程序,但它们肯定会导致下游出现各种不可预测和令人讨厌的副作用。
If you encounter problems like the ones just mentioned, one of the HANDLE_ERROR() macros that we’ve sprinkled so liberally throughout the code will detect and alert you to the situation. As with traditional C programming, the lesson here is that functions return error codes for a reason. Although it is always tempting to ignore these error codes, we would love to save you the hours of pain through which we have suffered by urging that you check the results of every operation that can fail. As is often the case, the presence of these errors will not prevent you from continuing the execution of your application, but they will most certainly cause all manner of unpredictable and unsavory side effects downstream.
此时,您正在 GPU 上并行运行代码。也许您听说过这很棘手,或者您必须了解计算机图形学才能在图形处理器上进行通用编程。我们希望您开始了解 CUDA C 如何使在 GPU 上编写并行代码变得更加容易。我们仅使用该示例对长度为 10 的向量求和。如果您想了解生成大规模并行应用程序有多么容易,请尝试将行中的 10 更改#define N 10为 10000 或 50000,以启动数万个并行块。但请注意:启动区块的尺寸不得超过 65,535。这只是硬件施加的限制,因此如果您尝试使用比这更多的块启动,您将开始看到失败。在下一章中,我们将了解如何在这个限制内工作。
At this point, you’re running code in parallel on the GPU. Perhaps you had heard this was tricky or that you had to understand computer graphics to do general-purpose programming on a graphics processor. We hope you are starting to see how CUDA C makes it much easier to get started writing parallel code on a GPU. We used the example only to sum vectors of length 10. If you would like to see how easy it is to generate a massively parallel application, try changing the 10 in the line #define N 10 to 10000 or 50000 to launch tens of thousands of parallel blocks. Be warned, though: No dimension of your launch of blocks may exceed 65,535. This is simply a hardware-imposed limit, so you will start to see failures if you attempt launches with more blocks than this. In the next chapter, we will see how to work within this limitation.
我们并不是暗示添加向量一点也不有趣,但下面的示例将满足那些寻找并行 CUDA C 的一些华丽示例的人。
We don’t mean to imply that adding vectors is anything less than fun, but the following example will satisfy those looking for some flashy examples of parallel CUDA C.
以下示例将演示绘制 Julia 集切片的代码。对于外行来说,Julia 集是复数上某类函数的边界。毫无疑问,这听起来比向量加法和矩阵乘法更没有乐趣。然而,对于函数的几乎所有值参数,这个边界形成了分形,这是数学中最有趣和最美丽的好奇心之一。
The following example will demonstrate code to draw slices of the Julia Set. For the uninitiated, the Julia Set is the boundary of a certain class of functions over complex numbers. Undoubtedly, this sounds even less fun than vector addition and matrix multiplication. However, for almost all values of the functions’ parameters, this boundary forms a fractal, one of the most interesting and beautiful curiosities of mathematics.
生成这样一个集合所涉及的计算非常简单。 Julia 集的核心是计算复杂平面中点的简单迭代方程。如果迭代方程的过程针对某个点发散,则该点不在集合中。也就是说,如果通过迭代方程产生的值序列向无穷大增长,则认为该点位于集合之外。相反,如果方程取的值保持有界,则该点位于集合中。
The calculations involved in generating such a set are quite simple. At its heart, the Julia Set evaluates a simple iterative equation for points in the complex plane. A point is not in the set if the process of iterating the equation diverges for that point. That is, if the sequence of values produced by iterating the equation grows toward infinity, a point is considered outside the set. Conversely, if the values taken by the equation remain bounded, the point is in the set.
从计算角度来看,所讨论的迭代方程非常简单,如方程 4.1所示。
Computationally, the iterative equation in question is remarkably simple, as shown in Equation 4.1.
因此,计算方程 4.1的迭代需要对当前值进行平方并添加一个常数以获得方程的下一个值。
Computing an iteration of Equation 4.1 would therefore involve squaring the current value and adding a constant to get the next value of the equation.
我们现在将检查一个源列表,它将计算和可视化 Julia 集。由于这是一个比我们迄今为止研究的更复杂的程序,因此我们将在这里将其分成几个部分。在本章后面,您将看到整个源列表。
We will examine a source listing now that will compute and visualize the Julia Set. Since this is a more complicated program than we have studied so far, we will split it into pieces here. Later in the chapter, you will see the entire source listing.
我们的主要例程非常简单。它使用提供的实用程序库创建适当大小的位图图像。接下来,它将指向位图数据的指针传递给内核函数。
Our main routine is remarkably simple. It creates the appropriate size bitmap image using a utility library provided. Next, it passes a pointer to the bitmap data to the kernel function.
计算内核只不过是迭代我们关心渲染的所有点,调用julia()每个点来确定 Julia 集中的成员资格。julia()如果该点在集合中,该函数将返回 1;如果不在集合中,该函数将返回 0。如果返回 1,我们将点的颜色设置为红色,julia()如果返回 0,则将点的颜色设置为黑色。这些颜色是任意的,您应该随意选择符合您个人审美的配色方案。
The computation kernel does nothing more than iterate through all points we care to render, calling julia() on each to determine membership in the Julia Set. The function julia() will return 1 if the point is in the set and 0 if it is not in the set. We set the point’s color to be red if julia() returns 1 and black if it returns 0. These colors are arbitrary, and you should feel free to choose a color scheme that matches your personal aesthetics.
这个函数是这个例子的核心。我们首先将像素坐标转换为复杂空间中的坐标。为了将复平面居中于图像中心,我们移动DIM/2。然后,为了确保图像跨越-1.0到1.0的范围,我们将图像坐标缩放DIM/2。因此,给定一个图像点(x,y),我们得到复空间中的一个点( (DIM/2 – x)/(DIM/2), ((DIM/2 – y)/(DIM/2) )。
This function is the meat of the example. We begin by translating our pixel coordinate to a coordinate in complex space. To center the complex plane at the image center, we shift by DIM/2. Then, to ensure that the image spans the range of -1.0 to 1.0, we scale the image coordinate by DIM/2. Thus, given an image point at (x,y), we get a point in complex space at ( (DIM/2 – x)/(DIM/2), ((DIM/2 – y)/(DIM/2) ).
然后,为了可能放大或缩小,我们引入一个scale因素。目前,比例被硬编码为 1.5,但您应该调整此参数来放大或缩小。如果您确实雄心勃勃,可以将其设为命令行参数。
Then, to potentially zoom in or out, we introduce a scale factor. Currently, the scale is hard-coded to be 1.5, but you should tweak this parameter to zoom in or out. If you are feeling really ambitious, you could make this a command-line parameter.
获得复空间中的点后,我们需要判断该点是在 Julia 集合内还是在 Julia 集合外。如果您还记得上一节,我们通过计算迭代方程 Z n+1 = z n 2 + C的值来完成此操作。由于 C 是某个任意复值常量,我们选择它是-0.8 + 0.156i因为它恰好会产生有趣的图片。如果您想查看 Julia Set 的其他版本,您应该使用这个常量。
After obtaining the point in complex space, we then need to determine whether the point is in or out of the Julia Set. If you recall the previous section, we do this by computing the values of the iterative equation Zn+1 = zn2 + C. Since C is some arbitrary complex-valued constant, we have chosen -0.8 + 0.156i because it happens to yield an interesting picture. You should play with this constant if you want to see other versions of the Julia Set.
在示例中,我们计算该函数的 200 次迭代。每次迭代后,我们检查结果的大小是否超过某个阈值(对于我们的目的为 1,000)。如果是这样,则方程发散,我们可以返回 0 来指示该点不在集合中。另一方面,如果我们完成了所有 200 次迭代并且幅度仍然限制在 1,000 以下,我们假设该点在集合中,并且我们将 1 返回给调用者,kernel()。
In the example, we compute 200 iterations of this function. After each iteration, we check whether the magnitude of the result exceeds some threshold (1,000 for our purposes). If so, the equation is diverging, and we can return 0 to indicate that the point is not in the set. On the other hand, if we finish all 200 iterations and the magnitude is still bounded under 1,000, we assume that the point is in the set, and we return 1 to the caller, kernel().
由于所有计算都是在复数上执行的,因此我们定义了一个通用结构来存储复数。
Since all the computations are being performed on complex numbers, we define a generic structure to store complex numbers.
该结构用两个数据元素表示复数:单精度实部r和单精度虚部i。该结构定义了按预期组合复数的加法和乘法运算符。 (如果您完全不熟悉复数,可以在线快速入门。)最后,我们定义一个返回复数大小的方法。
The structure represents complex numbers with two data elements: a single-precision real component r and a single-precision imaginary component i. The structure defines addition and multiplication operators that combine complex numbers as expected. (If you are completely unfamiliar with complex numbers, you can get a quick primer online.) Finally, we define a method that returns the magnitude of the complex number.
设备实现与 CPU 版本非常相似,延续了您可能已经注意到的趋势。
The device implementation is remarkably similar to the CPU version, continuing a trend you may have noticed.
这个版本main()看起来比CPU版本复杂很多,但流程实际上是一样的。与 CPU 版本一样,我们创建一个DIMxDIM 使用我们的实用程序库的位图图像。但因为我们将在 GPU 上进行计算,所以我们还声明了一个指针,用于dev_bitmap保存设备上数据的副本。为了保存数据,我们需要使用分配内存cudaMalloc()。
This version of main() looks much more complicated than the CPU version, but the flow is actually identical. Like with the CPU version, we create a DIM x DIM bitmap image using our utility library. But because we will be doing computation on a GPU, we also declare a pointer called dev_bitmap to hold a copy of the data on the device. And to hold data, we need to allocate memory using cudaMalloc().
然后我们kernel()像在 CPU 版本中一样运行我们的函数,尽管现在它是一个__global__函数,这意味着它将在 GPU 上运行。与 CPU 示例一样,我们传递kernel()在上一行中分配的指针来存储结果。唯一的区别是内存现在驻留在 GPU 上,而不是主机系统上。
We then run our kernel() function exactly like in the CPU version, although now it is a __global__ function, meaning it will run on the GPU. As with the CPU example, we pass kernel() the pointer we allocated in the previous line to store the results. The only difference is that the memory resides on the GPU now, not on the host system.
最显着的区别是我们指定了执行函数的并行块数kernel()。由于每个点都可以独立于其他点进行计算,因此我们只需为要计算的每个点指定一个函数副本即可。我们提到,对于某些问题域,使用二维索引会有所帮助。毫不奇怪,计算二维域(例如复平面)上的函数值就是这些问题之一。因此,我们在这一行中指定了一个二维块网格:
The most significant difference is that we specify how many parallel blocks on which to execute the function kernel(). Because each point can be computed independently of every other point, we simply specify one copy of the function for each point we want to compute. We mentioned that for some problem domains, it helps to use two-dimensional indexing. Unsurprisingly, computing function values over a two-dimensional domain such as the complex plane is one of these problems. So, we specify a two-dimensional grid of blocks in this line:
暗淡3网格(DIM,DIM);
dim3 grid(DIM,DIM);
该类型dim3不是标准的 C 类型,以免您担心忘记了一些关键信息。相反,CUDA 运行时头文件定义了一些方便的类型来封装多维元组。该类型dim3表示一个三维元组,将用于指定我们启动的大小。但是,当我们如此明确地声明我们的发布是二维网格时,为什么我们要使用三维值呢?
The type dim3 is not a standard C type, lest you feared you had forgotten some key pieces of information. Rather, the CUDA runtime header files define some convenience types to encapsulate multidimensional tuples. The type dim3 represents a three-dimensional tuple that will be used to specify the size of our launch. But why do we use a three-dimensional value when we oh-so-clearly stated that our launch is a two-dimensional grid?
dim3坦率地说,我们这样做是因为CUDA 运行时期望的是三维值。尽管当前不支持三维启动网格,但 CUDA 运行时仍然需要一个dim3最后一个分量等于 1 的变量。当我们仅使用两个值初始化它时,就像我们在语句中所做的那样dim3 grid(DIM,DIM),CUDA 运行时会自动填充第三个维度值为 1,所以这里的一切都会按预期工作。尽管 NVIDIA 将来有可能支持三维网格,但现在我们只需与内核启动 API 进行良好的合作,因为当编码人员和 API 发生冲突时,API 总是获胜。
Frankly, we do this because a three-dimensional, dim3 value is what the CUDA runtime expects. Although a three-dimensional launch grid is not currently supported, the CUDA runtime still expects a dim3 variable where the last component equals 1. When we initialize it with only two values, as we do in the statement dim3 grid(DIM,DIM), the CUDA runtime automatically fills the third dimension with the value 1, so everything here will work as expected. Although it’s possible that NVIDIA will support a three-dimensional grid in the future, for now we’ll just play nicely with the kernel launch API because when coders and APIs fight, the API always wins.
然后,我们在这一行中将dim3变量传递grid给 CUDA 运行时:
We then pass our dim3 variable grid to the CUDA runtime in this line:
内核<<<网格,1>>>(dev_位图);
kernel<<<grid,1>>>( dev _ bitmap );
最后,结果驻留在设备上的结果是,执行后kernel(),我们必须将结果复制回主机。正如我们在前面的章节中了解到的,我们通过调用 来完成此操作cudaMemcpy(),并将方向指定cudaMemcpyDeviceToHost为最后一个参数。
Finally, a consequence of the results residing on the device is that after executing kernel(), we have to copy the results back to the host. As we learned in previous chapters, we accomplish this with a call to cudaMemcpy(), specifying the direction cudaMemcpyDeviceToHost as the last argument.
实现差异的最后一个问题之一在于 的实现kernel()。
One of the last wrinkles in the difference of implementation comes in the implementation of kernel().
首先,我们需要kernel()将其声明为__global__函数,以便它在设备上运行但可以从主机调用。与 CPU 版本不同,我们不再需要嵌套for()循环来生成通过的像素索引到julia()。与向量加法示例一样,CUDA 运行时在变量 中为我们生成这些索引blockIdx。这是可行的,因为我们声明块网格与图像具有相同的尺寸,因此我们为和(x,y)之间的每一对整数得到一个块。(0,0)(DIM-1, DIM-1)
First, we need kernel() to be declared as a __global__ function so it runs on the device but can be called from the host. Unlike the CPU version, we no longer need nested for() loops to generate the pixel indices that get passed to julia(). As with the vector addition example, the CUDA runtime generates these indices for us in the variable blockIdx. This works because we declared our grid of blocks to have the same dimensions as our image, so we get one block for each pair of integers (x,y) between (0,0) and (DIM-1, DIM-1).
接下来,我们唯一需要的附加信息是输出缓冲区的线性偏移量ptr。这是使用另一个内置变量 来计算的gridDim。该变量是所有块中的常量,并且仅保存已启动的网格的尺寸。在此示例中,它始终是值 ( DIM, DIM)。因此,将行索引乘以网格宽度并添加列索引将为我们提供一个ptr范围从0到 的唯一索引(DIM*DIM-1)。
Next, the only additional information we need is a linear offset into our output buffer, ptr. This gets computed using another built-in variable, gridDim. This variable is a constant across all blocks and simply holds the dimensions of the grid that was launched. In this example, it will always be the value (DIM, DIM). So, multiplying the row index by the grid width and adding the column index will give us a unique index into ptr that ranges from 0 to (DIM*DIM-1).
最后,我们检查确定一个点是否在 Julia 集合之内或之外的实际代码。该代码看起来应该与 CPU 版本相同,延续了我们现在在许多示例中看到的趋势。
Finally, we examine the actual code that determines whether a point is in or out of the Julia Set. This code should look identical to the CPU version, continuing a trend we have seen in many examples now.
再次,我们定义了一个cuComplex结构体,该结构体定义了用于存储具有单精度浮点分量的复数的方法。该结构还定义了加法和乘法运算符以及返回复数值大小的函数。
Again, we define a cuComplex structure that defines a method for storing a complex number with single-precision floating-point components. The structure also defines addition and multiplication operators as well as a function to return the magnitude of the complex value.
请注意,我们在 CUDA C 中使用的语言结构与我们在 CPU 版本中使用的语言结构相同。唯一的区别是限定符__device__,它表明此代码将在 GPU 上运行,而不是在主机上运行。回想一下,因为这些函数被声明为__device__函数,所以它们只能从其他__device__函数或__global__函数中调用。
Notice that we use the same language constructs in CUDA C that we use in our CPU version. The one difference is the qualifier __device__, which indicates that this code will run on a GPU and not on the host. Recall that because these functions are declared as __device__ functions, they will be callable only from other __device__ functions or from __global__ functions.
由于我们经常用注释来中断代码,因此以下是从头到尾的完整源代码清单:
Since we’ve interrupted the code with commentary so frequently, here is the entire source listing from start to finish:
当您运行该应用程序时,您应该会看到 Julia Set 的可视化效果。为了让您相信它已获得“有趣的示例”的称号,图 4.2显示了从此应用程序中截取的屏幕截图。
When you run the application, you should see a visualization of the Julia Set. To convince you that it has earned the title “A Fun Example,” Figure 4.2 shows a screenshot taken from this application.
Figure 4.2 A screenshot from the GPU Julia Set application
恭喜,您现在可以在图形处理器上编写、编译和运行大规模并行代码!你应该去向你的朋友们吹牛。如果他们仍然错误地认为 GPU 计算很奇特且难以掌握,那么他们将会印象最深刻。您轻松完成这件事将是我们的秘密。如果你信任他们会保守你的秘密,建议他们也买这本书。
Congratulations, you can now write, compile, and run massively parallel code on a graphics processor! You should go brag to your friends. And if they are still under the misconception that GPU computing is exotic and difficult to master, they will be most impressed. The ease with which you accomplished it will be our secret. If they’re people you trust with your secrets, suggest that they buy the book, too.
到目前为止,我们已经研究了如何指示 CUDA 运行时在我们所谓的块上并行执行程序的多个副本。我们将在 GPU 上启动的块的集合称为网格。顾名思义,网格可以是一维或二维块的集合。内核的每个副本都可以使用内置变量确定它正在执行哪个块blockIdx。同样,它可以使用内置变量 确定网格的大小gridDim。事实证明,这两个内置变量在我们的内核中对于计算每个块负责的数据索引很有用。
We have so far looked at how to instruct the CUDA runtime to execute multiple copies of our program in parallel on what we called blocks. We called the collection of blocks we launch on the GPU a grid. As the name might imply, a grid can be either a one- or two-dimensional collection of blocks. Each copy of the kernel can determine which block it is executing with the built-in variable blockIdx. Likewise, it can determine the size of the grid by using the built-in variable gridDim. Both of these built-in variables proved useful within our kernel to calculate the data index for which each block is responsible.
现在,我们已经使用 CUDA C 编写了第一个程序,并且了解了如何编写在 GPU 上并行执行的代码。这是一个极好的开始!但可以说,并行编程最重要的组成部分之一是并行处理元素协作解决问题的方式。很少有问题是每个处理器都可以计算结果并终止执行,而无需考虑其他处理器正在做什么。即使对于中等复杂的算法,我们也需要代码的并行副本来进行通信和合作。到目前为止,我们还没有看到任何机制可以在并行执行的 CUDA C 代码部分之间完成这种通信。幸运的是,有一个解决方案,我们将在本章中开始探索。
We have now written our first program using CUDA C as well as have seen how to write code that executes in parallel on a GPU. This is an excellent start! But arguably one of the most important components to parallel programming is the means by which the parallel processing elements cooperate on solving a problem. Rare are the problems where every processor can compute results and terminate execution without a passing thought as to what the other processors are doing. For even moderately sophisticated algorithms, we will need the parallel copies of our code to communicate and cooperate. So far, we have not seen any mechanisms for accomplishing this communication between sections of CUDA C code executing in parallel. Fortunately, there is a solution, one that we will begin to explore in this chapter.
通过本章的课程,您将完成以下任务:
Through the course of this chapter, you will accomplish the following:
• 您将了解CUDA C 所谓的线程。
• You will learn about what CUDA C calls threads.
• 您将学习不同线程相互通信的机制。
• You will learn a mechanism for different threads to communicate with each other.
• 您将学习一种同步不同线程并行执行的机制。
• You will learn a mechanism to synchronize the parallel execution of different threads.
在上一章中,我们了解了如何在 GPU 上启动并行代码。我们通过指示 CUDA 运行时系统启动多少个内核并行副本来做到这一点。我们将这些并行副本称为块。
In the previous chapter, we looked at how to launch parallel code on the GPU. We did this by instructing the CUDA runtime system on how many parallel copies of our kernel to launch. We call these parallel copies blocks.
CUDA运行时允许将这些块分割成线程。回想一下,当我们启动多个并行块时,我们将尖括号中的第一个参数从 1 更改为我们想要启动的块数。例如,当我们研究向量加法时,我们通过调用以下命令为大小为 N 的向量中的每个元素启动一个块:
The CUDA runtime allows these blocks to be split into threads. Recall that when we launched multiple parallel blocks, we changed the first argument in the angle brackets from 1 to the number of blocks we wanted to launch. For example, when we studied vector addition, we launched a block for each element in the vector of size N by calling this:
添加<<<N,1>>>( dev_a, dev_b, dev_c );
add<<<N,1>>>( dev_a, dev_b, dev_c );
在尖括号内,第二个参数实际上表示我们希望 CUDA 运行时代表我们创建的每个块的线程数。到目前为止,我们每个块只启动了一个线程。在前面的示例中,我们启动了以下内容:
Inside the angle brackets, the second parameter actually represents the number of threads per block we want the CUDA runtime to create on our behalf. To this point, we have only ever launched one thread per block. In the previous example, we launched the following:
N 块 x 1 线程/块 = N 个并行线程
N blocks x 1 thread/block = N parallel threads
所以实际上,我们可以启动N/2每个块有两个线程的块,N/4每个块有四个线程的块,等等。让我们用有关 CUDA C 功能的新信息重新审视我们的向量加法示例。
So really, we could have launched N/2 blocks with two threads per block, N/4 blocks with four threads per block, and so on. Let’s revisit our vector addition example armed with this new information about the capabilities of CUDA C.
我们努力完成与上一章相同的任务。也就是说,我们想要获取两个输入向量并将它们的和存储在第三个输出向量中。不过,这次我们将使用线程而不是块来完成此任务。
We endeavor to accomplish the same task as we did in the previous chapter. That is, we want to take two input vectors and store their sum in a third output vector. However, this time we will use threads instead of blocks to accomplish this.
您可能想知道,使用线程而不是块有什么优势?好吧,目前来说,没有什么优点值得讨论。但是块内的并行线程将能够执行并行块无法执行的操作。因此,现在,请耐心等待并幽默地介绍我们上一章中并行块示例的并行线程版本。
You may be wondering, what is the advantage of using threads rather than blocks? Well, for now, there is no advantage worth discussing. But parallel threads within a block will have the ability to do things that parallel blocks cannot do. So for now, be patient and humor us while we walk through a parallel thread version of the parallel block example from the previous chapter.
我们将从解决从并行块转移到并行线程时值得注意的两个变化开始。我们的内核调用将从N每次启动一个线程块的内核调用发生变化:
We will start by addressing the two changes of note when moving from parallel blocks to parallel threads. Our kernel invocation will change from one that launches N blocks of one thread apiece:
添加<<<N,1>>>( dev_a, dev_b, dev_c );
add<<<N,1>>>( dev_a, dev_b, dev_c );
到一个启动线程的版本N,所有线程都在一个块内:
to a version that launches N threads, all within one block:
添加<<<1,N>>>( dev_a, dev_b, dev_c );
add<<<1,N>>>( dev_a, dev_b, dev_c );
唯一的其他变化是我们索引数据的方法。以前,在我们的内核中,我们通过块索引对输入和输出数据进行索引。
The only other change arises in the method by which we index our data. Previously, within our kernel we indexed the input and output data by block index.
这里的妙语应该不足为奇。现在我们只有一个块,我们必须通过线程索引来索引数据。
The punch line here should not be a surprise. Now that we have only a single block, we have to index the data by thread index.
这是从并行块实现转移到并行线程实现所需的仅有的两个更改。为了完整起见,以下是完整的源代码列表,其中更改的行以粗体显示:
These are the only two changes required to move from a parallel block implementation to a parallel thread implementation. For completeness, here is the entire source listing with the changed lines in bold:
很简单的事情,对吧?在下一节中,我们将看到这种仅线程方法的局限性之一。当然,稍后我们会明白为什么我们还要费心将块分割成其他并行组件。
Pretty simple stuff, right? In the next section, we’ll see one of the limitations of this thread-only approach. And of course, later we’ll see why we would even bother splitting blocks into other parallel components.
在上一章中,我们注意到硬件将单次启动的块数限制为 65,535。同样,硬件限制了我们可以启动内核的每个块的线程数量。具体来说,这个数字不能超过我们在第 3 章maxThreadsPerBlock中查看的设备属性结构字段指定的值。对于当前可用的许多图形处理器来说,这个限制是每块 512 个线程,那么我们如何使用基于线程的方法来添加两个大小大于 512 的向量呢?我们必须使用线程和块的组合来完成此任务。
In the previous chapter, we noted that the hardware limits the number of blocks in a single launch to 65,535. Similarly, the hardware limits the number of threads per block with which we can launch a kernel. Specifically, this number cannot exceed the value specified by the maxThreadsPerBlock field of the device properties structure we looked at in Chapter 3. For many of the graphics processors currently available, this limit is 512 threads per block, so how would we use a thread-based approach to add two vectors of size greater than 512? We will have to use a combination of threads and blocks to accomplish this.
和以前一样,这将需要两个更改:我们将必须更改内核内的索引计算,并且我们将必须更改内核启动本身。
As before, this will require two changes: We will have to change the index computation within the kernel, and we will have to change the kernel launch itself.
现在我们有了多个块和线程,索引将开始看起来类似于从二维索引空间转换为线性空间的标准方法。
Now that we have multiple blocks and threads, the indexing will start to look similar to the standard method for converting from a two-dimensional index space to a linear space.
此赋值使用新的内置变量blockDim.该变量对于所有块来说都是一个常量,并存储沿块的每个维度的线程数。由于我们使用的是一维块,因此我们仅引用blockDim.x。如果您还记得,gridDim存储了一个类似的值,但它存储了整个网格每个维度上的块数。而且,gridDim是二维的,而blockDim实际上是三维的。也就是说,CUDA 运行时允许您启动二维块网格,其中每个块都是一个三维线程数组。是的,这是很多维度,并且您不太可能经常需要为您提供的五个索引自由度,但如果需要,它们是可用的。
This assignment uses a new built-in variable, blockDim. This variable is a constant for all blocks and stores the number of threads along each dimension of the block. Since we are using a one-dimensional block, we refer only to blockDim.x. If you recall, gridDim stored a similar value, but it stored the number of blocks along each dimension of the entire grid. Moreover, gridDim is two-dimensional, whereas blockDim is actually three-dimensional. That is, the CUDA runtime allows you to launch a two-dimensional grid of blocks where each block is a three-dimensional array of threads. Yes, this is a lot of dimensions, and it is unlikely you will regularly need the five degrees of indexing freedom afforded you, but they are available if so desired.
使用先前的分配对线性数组中的数据进行索引实际上非常直观。如果您不同意,那么从空间上考虑线程块的集合可能会有所帮助,类似于二维像素数组。我们在图 5.1中描述了这种安排。
Indexing the data in a linear array using the previous assignment actually is quite intuitive. If you disagree, it may help to think about your collection of blocks of threads spatially, similar to a two-dimensional array of pixels. We depict this arrangement in Figure 5.1.
Figure 5.1 A two-dimensional arrangement of a collection of blocks and threads
如果线程代表列,块代表行,那么我们可以通过将块索引与每个块中的线程数相乘并添加块内的线程索引来获得唯一索引。这与我们在 Julia Set 示例中用于线性化二维图像索引的方法相同。
If the threads represent columns and the blocks represent rows, we can get a unique index by taking the product of the block index with the number of threads in each block and adding the thread index within the block. This is identical to the method we used to linearize the two-dimensional image index in the Julia Set example.
这里,DIM是块维度(以线程为单位测量),y是块索引,x是块内的线程索引。因此,我们得到索引:tid = threadIdx.x + blockIdx.x * blockDim.x。
Here, DIM is the block dimension (measured in threads), y is the block index, and x is the thread index within the block. Hence, we arrive at the index: tid = threadIdx.x + blockIdx.x * blockDim.x.
另一个变化是内核启动本身。我们仍然需要N并行线程来启动,但我们希望它们跨多个块启动,这样我们就不会达到 512 线程的限制。一种解决方案是将块大小任意设置为某个固定数量的线程;对于本例,我们每个块使用 128 个线程。然后我们可以启动N/128块来运行所有N线程。
The other change is to the kernel launch itself. We still need N parallel threads to launch, but we want them to launch across multiple blocks so we do not hit the 512-thread limitation imposed upon us. One solution is to arbitrarily set the block size to some fixed number of threads; for this example, let’s use 128 threads per block. Then we can just launch N/128 blocks to get our total of N threads running.
这里的问题是N/128整数除法。这意味着如果N是 127,N/128则为零,如果我们启动零个线程,我们实际上不会计算任何东西。事实上,只要N不是 128 的精确倍数,我们就会启动太少的线程。这很糟糕。我们实际上希望对这个部门进行四舍五入。
The wrinkle here is that N/128 is an integer division. This implies that if N were 127, N/128 would be zero, and we will not actually compute anything if we launch zero threads. In fact, we will launch too few threads whenever N is not an exact multiple of 128. This is bad. We actually want this division to round up.
有一个常见的技巧可以在整数除法中完成此操作,而无需调用ceil().我们实际上计算(N+127)/128而不是N/128。您可以相信我们的话,这将计算大于或等于 128 的最小倍数,N或者您现在可以花点时间说服自己相信这一事实。
There is a common trick to accomplish this in integer division without calling ceil(). We actually compute (N+127)/128 instead of N/128. Either you can take our word that this will compute the smallest multiple of 128 greater than or equal to N or you can take a moment now to convince yourself of this fact.
我们选择每块 128 个线程,因此使用以下内核启动:
We have chosen 128 threads per block and therefore use the following kernel launch:
由于我们对确保启动足够线程的除法进行了更改,因此当不是 128 的精确倍数时,我们实际上会启动太多线程N。但是这个问题有一个简单的补救措施,并且我们的内核已经解决了这个问题。N在使用线程访问输入和输出数组之前,我们必须检查线程的偏移量是否实际上在 0 之间:
Because of our change to the division that ensures we launch enough threads, we will actually now launch too many threads when N is not an exact multiple of 128. But there is a simple remedy to this problem, and our kernel already takes care of it. We have to check whether a thread’s offset is actually between 0 and N before we use it to access our input and output arrays:
因此,当我们的索引超出数组末尾时(当我们启动 128 的非倍数时总会发生这种情况),我们会自动避免执行计算。更重要的是,我们避免在数组末尾读取和写入内存。
Thus, when our index overshoots the end of our array, as will always happen when we launch a nonmultiple of 128, we automatically refrain from performing the calculation. More important, we refrain from reading and writing memory off the end of our array.
当我们第一次讨论在 GPU 上启动并行块时,我们并没有完全坦白。除了线程数的限制之外,还存在块数的硬件限制(尽管比线程限制大得多)。正如我们之前提到的,块网格的维度都不能超过 65,535。
We were not completely forthcoming when we first discussed launching parallel blocks on a GPU. In addition to the limitation on thread count, there is also a hardware limitation on the number of blocks (albeit much greater than the thread limitation). As we’ve mentioned previously, neither dimension of a grid of blocks may exceed 65,535.
因此,这给我们当前的向量加法实现带来了一个问题。如果我们启动N/128块来添加向量,当向量超过 65,535 * 128 = 8,388,480 个元素时,我们将遇到启动失败。这看起来是一个很大的数字,但随着当前内存容量在 1GB 到 4GB 之间,高端图形处理器可以比具有 800 万个元素的向量容纳更多数量级的数据。
So, this raises a problem with our current vector addition implementation. If we launch N/128 blocks to add our vectors, we will hit launch failures when our vectors exceed 65,535 * 128 = 8,388,480 elements. This seems like a large number, but with current memory capacities between 1GB and 4GB, the high-end graphics processors can hold orders of magnitude more data than vectors with 8 million elements.
幸运的是,这个问题的解决方案非常简单。我们首先对内核进行更改。
Fortunately, the solution to this issue is extremely simple. We first make a change to our kernel.
这看起来非常像我们原始版本的向量加法!事实上,将其与上一章中的以下 CPU 实现进行比较:
This looks remarkably like our original version of vector addition! In fact, compare it to the following CPU implementation from the previous chapter:
这里我们还使用了while()循环来迭代数据。回想一下,我们声称多 CPU 或多核版本可以增加我们想要使用的处理器数量,而不是将数组索引增加 1。现在我们将在 GPU 版本中使用相同的原理。
Here we also used a while() loop to iterate through the data. Recall that we claimed that rather than incrementing the array index by 1, a multi-CPU or multicore version could increment by the number of processors we wanted to use. We will now use that same principle in the GPU version.
在GPU实现中,我们将启动的并行线程的数量视为处理器的数量。尽管实际的 GPU 可能具有比这更少(或更多)的处理单元,但我们将每个线程视为逻辑上并行执行,然后允许硬件调度实际执行。将并行化与硬件执行的实际方法解耦是 CUDA C 减轻软件开发人员肩上的负担之一。考虑到当前的 NVIDIA 硬件每个芯片可以配备 8 到 480 个算术单元,这应该让人松了口气!
In the GPU implementation, we consider the number of parallel threads launched to be the number of processors. Although the actual GPU may have fewer (or more) processing units than this, we think of each thread as logically executing in parallel and then allow the hardware to schedule the actual execution. Decoupling the parallelization from the actual method of hardware execution is one of the burdens that CUDA C lifts off a software developer’s shoulders. This should come as a relief, considering current NVIDIA hardware can ship with anywhere between 8 and 480 arithmetic units per chip!
现在我们了解了这个实现背后的原理,我们只需要了解如何确定每个并行线程的初始索引值以及我们如何确定增量。我们希望每个并行线程在不同的数据索引上启动,因此我们只需要获取线程和块索引并对它们进行线性化,就像我们在“较长向量的 GPU 求和”部分中看到的那样。每个线程将从以下给出的索引开始:
Now that we understand the principle behind this implementation, we just need to understand how we determine the initial index value for each parallel thread and how we determine the increment. We want each parallel thread to start on a different data index, so we just need to take our thread and block indexes and linearize them as we saw in the “GPU Sums of a Longer Vector” section. Each thread will start at an index given by the following:
每个线程在当前索引处完成其工作后,我们需要将每个线程增加网格中运行的线程总数。这只是每个块的线程数乘以网格中的块数,或blockDim.x * gridDim.x。因此,增量步骤如下:
After each thread finishes its work at the current index, we need to increment each of them by the total number of threads running in the grid. This is simply the number of threads per block multiplied by the number of blocks in the grid, or blockDim.x * gridDim.x. Hence, the increment step is as follows:
tid += blockDim.x * gridDim.x;
tid += blockDim.x * gridDim.x;
我们就快到了!唯一剩下的部分是修复启动本身。如果您还记得的话,我们走了这个弯路,因为当大于 65,535时启动add<<<(N+127)/128,128>>>( dev_a, dev_b, dev_c )将会失败。(N+127)/128为了确保我们永远不会启动太多的区块,我们会将区块的数量固定为一些相当小的值。由于我们非常喜欢复制和粘贴,因此我们将使用 128 个块,每个块有 128 个线程。
We are almost there! The only remaining piece is to fix the launch itself. If you remember, we took this detour because the launch add<<<(N+127)/128,128>>>( dev_a, dev_b, dev_c ) will fail when (N+127)/128 is greater than 65,535. To ensure we never launch too many blocks, we will just fix the number of blocks to some reasonably small value. Since we like copying and pasting so much, we will use 128 blocks, each with 128 threads.
添加<<<128,128>>>( dev_a, dev_b, dev_c );
add<<<128,128>>>( dev_a, dev_b, dev_c );
您可以随意调整这些值,只要您认为合适,只要您的值保持在我们讨论的限制范围内。在本书的后面,我们将讨论这些选择对性能的潜在影响,但现在选择每块 128 个线程和 128 个块就足够了。现在我们可以添加任意长度的向量,仅受 GPU 上的 RAM 数量的限制。这是完整的源列表:
You should feel free to adjust these values however you see fit, provided that your values remain within the limits we’ve discussed. Later in the book, we will discuss the potential performance implications of these choices, but for now it suffices to choose 128 threads per block and 128 blocks. Now we can add vectors of arbitrary length, limited only by the amount of RAM we have on our GPU. Here is the entire source listing:
与前一章一样,我们将通过提供一个更有趣的示例来奖励您对向量加法的耐心,该示例演示了我们一直在使用的一些技术。我们将再次使用 GPU 计算能力来按程序生成图片。但为了让事情变得更有趣,这次我们将为它们制作动画。但不用担心,我们已将所有不相关的动画代码打包到辅助函数中,因此您无需掌握任何图形或动画。
As with the previous chapter, we will reward your patience with vector addition by presenting a more fun example that demonstrates some of the techniques we’ve been using. We will again use our GPU computing power to generate pictures procedurally. But to make things even more interesting, this time we will animate them. But don’t worry, we’ve packaged all the unrelated animation code into helper functions so you won’t have to master any graphics or animation.
大部分复杂性main()都隐藏在辅助结构中CPUAnimBitmap。您会注意到,我们再次采用了一种模式cudaMalloc(),即执行使用分配的内存的设备代码,然后使用 进行清理cudaFree()。现在这对你来说应该是老问题了。
Most of the complexity of main() is hidden in the helper structure CPUAnimBitmap. You will notice that we again have a pattern of doing a cudaMalloc(), executing device code that uses the allocated memory, and then cleaning up with cudaFree(). This should be old hat to you by now.
在此示例中,我们稍微复杂化了完成中间步骤“执行使用分配的内存的设备代码”的方法。我们anim_and_exit()向该方法传递一个函数指针generate_frame()。每次想要生成新的动画帧时,结构都会调用此函数。
In this example, we have slightly convoluted the means by which we accomplish the middle step, “executing device code that uses the allocated memory.” We pass the anim_and_exit() method a function pointer to generate_frame(). This function will be called by the structure every time it wants to generate a new frame of the animation.
虽然这个函数只有四行,但它们都涉及重要的 CUDA C 概念。首先,我们声明两个二维变量blocks和threads。正如我们的命名约定非常明显的那样,该变量blocks表示我们将在网格中启动的并行块的数量。该变量threads表示我们将在每个块启动的线程数。因为我们正在生成图像,所以我们使用二维索引,以便每个线程都有一个唯一的(x,y)索引,我们可以轻松地将其与输出图像中的像素对应起来。我们选择使用的块包括16 x 16 线程数组。如果图像有DIMx 个DIM像素,我们需要启动DIM/16x 个DIM/16块才能为每个像素分配一个线程。图 5.2显示了这个块和线程配置在一个(荒谬的)小、48 像素宽、32 像素高的图像中的外观。
Although this function consists only of four lines, they all involve important CUDA C concepts. First, we declare two two-dimensional variables, blocks and threads. As our naming convention makes painfully obvious, the variable blocks represents the number of parallel blocks we will launch in our grid. The variable threads represents the number of threads we will launch per block. Because we are generating an image, we use two-dimensional indexing so that each thread will have a unique (x,y) index that we can easily put into correspondence with a pixel in the output image. We have chosen to use blocks that consist of a 16 x 16 array of threads. If the image has DIM x DIM pixels, we need to launch DIM/16 x DIM/16 blocks to get one thread per pixel. Figure 5.2 shows how this block and thread configuration would look in a (ridiculously) small, 48-pixel-wide, 32-pixel-high image.
图 5.2块和线程的 2D 层次结构,可用于使用每个像素一个线程来处理 48 x 32 像素图像
Figure 5.2 A 2D hierarchy of blocks and threads that could be used to process a 48 x 32 pixel image using one thread per pixel
如果您做过多线程 CPU 编程,您可能想知道为什么我们要启动这么多线程。例如,要渲染 1920 x 1080 的全高清动画,此方法将创建超过 200 万个线程。尽管我们通常会在 GPU 上创建和调度这么多线程,但人们不会梦想在 CPU 上创建这么多线程。由于 CPU 线程管理和调度必须在软件中完成,因此它根本无法扩展到 GPU 可以处理的线程数量。因为我们可以简单地为要处理的每个数据元素创建一个线程,所以 GPU 上的并行编程比 CPU 上的并行编程要简单得多。
If you have done any multithreaded CPU programming, you may be wondering why we would launch so many threads. For example, to render a full high-definition animation at 1920 x 1080, this method would create more than 2 million threads. Although we routinely create and schedule this many threads on a GPU, one would not dream of creating this many threads on a CPU. Because CPU thread management and scheduling must be done in software, it simply cannot scale to the number of threads that a GPU can. Because we can simply create a thread for each data element we want to process, parallel programming on a GPU can be far simpler than on a CPU.
声明保存启动尺寸的变量后,我们只需启动将计算像素值的内核即可。
After declaring the variables that hold the dimensions of our launch, we simply launch the kernel that will compute our pixel values.
内核 <<< 块,线程>>>( d->dev_bitmap,ticks );
kernel<<< blocks, threads>>>( d->dev_bitmap, ticks );
内核将需要我们作为参数传递的两条信息。首先,它需要一个指向保存输出像素的设备内存的指针。这是一个全局变量,其内存分配在main().但该变量仅对主机代码是“全局”的,因此我们需要将其作为参数传递,以确保 CUDA 运行时使其可用于我们的设备代码。
The kernel will need two pieces of information that we pass as parameters. First, it needs a pointer to device memory that holds the output pixels. This is a global variable that had its memory allocated in main(). But the variable is “global” only for host code, so we need to pass it as a parameter to ensure that the CUDA runtime will make it available for our device code.
其次,我们的内核需要知道当前的动画时间,以便它可以生成正确的帧。当前时间ticks是generate_frame()从 中的基础设施代码传递给函数的CPUAnimBitmap,因此我们可以简单地将其传递给我们的内核。
Second, our kernel will need to know the current animation time so it can generate the correct frame. The current time, ticks, is passed to the generate_frame() function from the infrastructure code in CPUAnimBitmap, so we can simply pass this on to our kernel.
现在,这是内核代码本身:
And now, here’s the kernel code itself:
The first three are the most important lines in the kernel.
在这些行中,每个线程获取其块内的索引以及网格内其块的索引,并将其转换为(x,y)图像内的唯一索引。因此,当位于(3, 5)block索引处的线程(12, 8)开始执行时,它知道其左侧有 12 个完整块,其上方有 8 个完整块。在其块内,线程 at 的(3, 5)左侧有 3 个线程,上方有 5 个线程。因为每个块有 16 个线程,这意味着有问题的线程具有以下内容:
In these lines, each thread takes its index within its block as well as the index of its block within the grid, and it translates this into a unique (x,y) index within the image. So when the thread at index (3, 5) in block (12, 8) begins executing, it knows that there are 12 entire blocks to the left of it and 8 entire blocks above it. Within its block, the thread at (3, 5) has three threads to the left and five above it. Because there are 16 threads per block, this means the thread in question has the following:
3 个线程 + 12 个块 * 16 个线程/块 = 左侧有 195 个线程
3 threads + 12 blocks * 16 threads/block = 195 threads to the left of it
5 个线程 + 8 个块 * 16 个线程/块 = 其上方 128 个线程
5 threads + 8 blocks * 16 threads/block = 128 threads above it
x此计算与前两行中和的计算相同y,也是我们将线程和块索引映射到图像坐标的方式。然后我们简单地将这些x和y值线性化以获得输出缓冲区的偏移量。同样,这与我们在“较长向量的 GPU 求和”和“任意长向量的 GPU 求和”部分中所做的相同。
This computation is identical to the computation of x and y in the first two lines and is how we map the thread and block indices to image coordinates. Then we simply linearize these x and y values to get an offset into the output buffer. Again, this is identical to what we did in the “GPU Sums of a Longer Vector” and “GPU Sums of Arbitrarily Long Vectors” sections.
int offset = x + y * blockDim.x * gridDim.x;
int offset = x + y * blockDim.x * gridDim.x;
由于我们知道(x,y)线程应该计算图像中的哪个像素,并且我们知道它需要计算该值的时间,因此我们可以计算任何函数(x,y,t)并将该值存储在输出缓冲区中。在这种情况下,该函数会产生随时间变化的正弦“纹波”。
Since we know which (x,y) pixel in the image the thread should compute and we know the time at which it needs to compute this value, we can compute any function of (x,y,t) and store this value in the output buffer. In this case, the function produces a time-varying sinusoidal “ripple.”
我们建议您不要太沉迷于 的计算grey。它本质上只是时间的 2D 函数,在动画时会产生漂亮的涟漪效果。一帧的屏幕截图应类似于图 5.3。
We recommend that you not get too hung up on the computation of grey. It’s essentially just a 2D function of time that makes a nice rippling effect when it’s animated. A screenshot of one frame should look something like Figure 5.3.
Figure 5.3 A screenshot from the GPU ripple example
到目前为止,将块分割成线程的动机只是为了解决我们可以拥有的块数量的硬件限制。这是相当弱的动机,因为这可以很容易地由 CUDA 运行时在幕后完成。幸运的是,还有其他原因可能需要将块拆分为线程。
So far, the motivation for splitting blocks into threads was simply one of working around hardware limitations to the number of blocks we can have in flight. This is fairly weak motivation, because this could easily be done behind the scenes by the CUDA runtime. Fortunately, there are other reasons one might want to split a block into threads.
CUDA C 提供了一个我们称为共享内存的内存区域。该内存区域带来了 C 语言的另一个扩展,类似于__device__和__global__。作为程序员,您可以使用 CUDA C 关键字修改变量声明,__shared__以使该变量驻留在共享内存中。但有什么意义呢?
CUDA C makes available a region of memory that we call shared memory. This region of memory brings along with it another extension to the C language akin to __device__ and __global__. As a programmer, you can modify your variable declarations with the CUDA C keyword __shared__ to make this variable resident in shared memory. But what’s the point?
我们很高兴您提出这个问题。 CUDA C 编译器以不同于典型变量的方式对待共享内存中的变量。它为您在 GPU 上启动的每个块创建变量的副本。该块中的每个线程共享内存,但线程无法查看或修改在其他块中看到的该变量的副本。这提供了一种极好的方法,块内的线程可以通过该方法进行通信和协作计算。此外,共享内存缓冲区物理上驻留在 GPU 上,而不是驻留在片外 DRAM 中。因此,访问共享内存的延迟往往远低于典型缓冲区,从而使共享内存作为按块、软件管理的缓存或暂存器有效。
We’re glad you asked. The CUDA C compiler treats variables in shared memory differently than typical variables. It creates a copy of the variable for each block that you launch on the GPU. Every thread in that block shares the memory, but threads cannot see or modify the copy of this variable that is seen within other blocks. This provides an excellent means by which threads within a block can communicate and collaborate on computations. Furthermore, shared memory buffers reside physically on the GPU as opposed to residing in off-chip DRAM. Because of this, the latency to access shared memory tends to be far lower than typical buffers, making shared memory effective as a per-block, software-managed cache or scratchpad.
线程之间通信的前景应该会让您兴奋。这也让我们兴奋不已。但生活中没有什么是免费的,线程间通信也不例外。如果我们希望在线程之间进行通信,我们还需要一种线程之间同步的机制。例如,如果线程 A 将一个值写入共享内存,并且我们希望线程 B 使用该值执行某些操作,则在我们知道线程 A 的写入已完成之前,我们无法让线程 B 开始工作。如果没有同步,我们就创建了一个竞争条件,其中执行结果的正确性取决于硬件的不确定细节。
The prospect of communication between threads should excite you. It excites us, too. But nothing in life is free, and interthread communication is no exception. If we expect to communicate between threads, we also need a mechanism for synchronizing between threads. For example, if thread A writes a value to shared memory and we want thread B to do something with this value, we can’t have thread B start its work until we know the write from thread A is complete. Without synchronization, we have created a race condition where the correctness of the execution results depends on the nondeterministic details of the hardware.
让我们看一个使用这些功能的示例。
Let’s take a look at an example that uses these features.
恭喜!我们已经完成了向量加法,现在来看看向量点积(有时称为内积)。我们将快速回顾一下什么是点积,以防万一您不熟悉向量数学(或者已经有几年了)。计算由两个步骤组成。首先,我们将两个输入向量的相应元素相乘。这与向量加法非常相似,但使用乘法而不是加法。然而,我们不是将这些值存储到第三个输出向量,而是将它们全部相加以产生单个标量输出。
Congratulations! We have graduated from vector addition and will now take a look at vector dot products (sometimes called an inner product). We will quickly review what a dot product is, just in case you are unfamiliar with vector mathematics (or it has been a few years). The computation consists of two steps. First, we multiply corresponding elements of the two input vectors. This is very similar to vector addition but utilizes multiplication instead of addition. However, instead of then storing these values to a third, output vector, we sum them all to produce a single scalar output.
例如,如果我们取两个四元素向量的点积,我们将得到方程 5.1。
For example, if we take the dot product of two four-element vectors, we would get Equation 5.1.
也许我们倾向于使用的算法正变得越来越明显。我们可以像矢量加法一样完成第一步。每个线程将一对相应的条目相乘,然后每个线程移动到其下一对。由于结果需要是所有这些成对乘积的总和,因此每个线程都会保留其所添加的对的运行总和。就像在加法示例中一样,线程将其索引增加线程总数,以确保我们不会错过任何元素,并且不会将一对相乘两次。这是点积例程的第一步:
Perhaps the algorithm we tend to use is becoming obvious. We can do the first step exactly how we did vector addition. Each thread multiplies a pair of corresponding entries, and then every thread moves on to its next pair. Because the result needs to be the sum of all these pairwise products, each thread keeps a running sum of the pairs it has added. Just like in the addition example, the threads increment their indices by the total number of threads to ensure we don’t miss any elements and don’t multiply a pair twice. Here is the first step of the dot product routine:
正如您所看到的,我们声明了一个名为 的共享内存缓冲区cache。该缓冲区将用于存储每个线程的运行总和。很快我们就会明白为什么要这样做,但现在我们将简单地检查一下实现它的机制。声明一个变量驻留在共享内存中是很简单的,它static与volatile在标准 C 中声明变量的方式相同:
As you can see, we have declared a buffer of shared memory named cache. This buffer will be used to store each thread’s running sum. Soon we will see why we do this, but for now we will simply examine the mechanics by which we accomplish it. It is trivial to declare a variable to reside in shared memory, and it is identical to the means by which you declare a variable as static or volatile in standard C:
__shared__float 缓存[threadsPerBlock];
__shared__float cache[threadsPerBlock];
我们声明大小的数组threadsPerBlock,以便块中的每个线程都有一个地方来存储其临时结果。回想一下,当我们全局分配内存时,我们为运行内核的每个线程分配了足够的内存,或者threadsPerBlock乘以块总数。但由于编译器将为每个块创建共享变量的副本,因此我们只需分配足够的内存,以便块中的每个线程都有一个条目。
We declare the array of size threadsPerBlock so each thread in the block has a place to store its temporary result. Recall that when we have allocated memory globally, we allocated enough for every thread that runs the kernel, or threadsPerBlock times the total number of blocks. But since the compiler will create a copy of the shared variables for each block, we need to allocate only enough memory such that each thread in the block has an entry.
分配共享内存后,我们像过去一样计算数据索引:
After allocating the shared memory, we compute our data indices much like we have in the past:
变量的计算tid现在看起来应该很熟悉;我们只是将块索引和线程索引组合起来以获得输入数组的全局偏移量。共享内存缓存的偏移量就是我们的线程索引。同样,我们不需要将块索引合并到这个偏移量中,因为每个块都有自己的共享内存的私有副本。
The computation for the variable tid should look familiar by now; we are just combining the block and thread indices to get a global offset into our input arrays. The offset into our shared memory cache is simply our thread index. Again, we don’t need to incorporate our block index into this offset because each block has its own private copy of this shared memory.
最后,我们清除共享内存缓冲区,以便稍后我们能够盲目地对整个数组求和,而不必担心特定条目是否存储有有效数据:
Finally, we clear our shared memory buffer so that later we will be able to blindly sum the entire array without worrying whether a particular entry has valid data stored there:
如果输入向量的大小不是每个块线程数的倍数,则可能不会使用每个条目。在这种情况下,最后一个块将有一些不执行任何操作的线程,因此不会写入值。
It will be possible that not every entry will be used if the size of the input vectors is not a multiple of the number of threads per block. In this case, the last block will have some threads that do nothing and therefore do not write values.
每个线程计算 和 中相应条目的乘积的运行a和b。到达数组末尾后,每个线程将其临时总和存储到共享缓冲区中。
Each thread computes a running sum of the product of corresponding entries in a and b. After reaching the end of the array, each thread stores its temporary sum into the shared buffer.
此时,在算法中,我们需要对放入缓存中的所有临时值求和。为此,我们需要一些线程来读取存储在那里的值。然而,正如我们提到的,这是一个潜在危险的操作。我们需要一种方法来保证cache[]在任何人尝试从此缓冲区读取之前完成所有对共享数组的写入。幸运的是,存在这样的方法:
At this point in the algorithm, we need to sum all the temporary values we’ve placed in the cache. To do this, we will need some of the threads to read the values that have been stored there. However, as we mentioned, this is a potentially dangerous operation. We need a method to guarantee that all of these writes to the shared array cache[] complete before anyone tries to read from this buffer. Fortunately, such a method exists:
此调用保证块中的每个线程在__syncthreads()硬件执行下一个之前都已完成指令任何线程上的指令。这正是我们所需要的!我们现在知道,当第一个线程执行我们之后的第一条指令时__syncthreads(),块中的所有其他线程也都完成了执行__syncthreads()。
This call guarantees that every thread in the block has completed instructions prior to the __syncthreads() before the hardware will execute the next instruction on any thread. This is exactly what we need! We now know that when the first thread executes the first instruction after our __syncthreads(), every other thread in the block has also finished executing up to the __syncthreads().
现在我们已经保证临时缓存已被填充,我们可以对其中的值求和。我们将获取输入数组并执行一些计算以产生较小结果数组的一般过程称为归约。归约经常出现在并行计算中,这导致了给它们起一个名字的愿望。
Now that we have guaranteed that our temporary cache has been filled, we can sum the values in it. We call the general process of taking an input array and performing some computations that produce a smaller array of results a reduction. Reductions arise often in parallel computing, which leads to the desire to give them a name.
实现这一减少的简单方法是让一个线程迭代共享内存并计算运行总和。这将花费与数组长度成正比的时间。但是,由于我们有数百个线程可用于完成我们的工作,因此我们可以并行执行此缩减操作,并花费与数组长度的对数成正比的时间。首先,下面的代码看起来很复杂;我们稍后会分解它。
The naïve way to accomplish this reduction would be having one thread iterate over the shared memory and calculate a running sum. This will take us time proportional to the length of the array. However, since we have hundreds of threads available to do our work, we can do this reduction in parallel and take time that is proportional to the logarithm of the length of the array. At first, the following code will look convoluted; we’ll break it down in a moment.
总体思路是每个线程将其中的两个值相加cache[]并将结果存储回cache[]。由于每个线程将两个条目合并为一个,因此我们完成此步骤时使用的条目数量是开始时的一半。在下一步中,我们对剩下的一半做同样的事情。我们继续以这种方式进行log2(threadsPerBlock)步骤,直到得到 中每个条目的总和cache[]。对于我们的示例,我们每个块使用 256 个线程,因此需要对该过程进行 8 次迭代才能将 256 个条目减少为cache[]单个总和。
The general idea is that each thread will add two of the values in cache[] and store the result back to cache[]. Since each thread combines two entries into one, we complete this step with half as many entries as we started with. In the next step, we do the same thing on the remaining half. We continue in this fashion for log2(threadsPerBlock) steps until we have the sum of every entry in cache[]. For our example, we’re using 256 threads per block, so it takes 8 iterations of this process to reduce the 256 entries in cache[] to a single sum.
其代码如下:
The code for this follows:
Figure 5.4 One step of a summation reduction
第一步,我们从i的一半数量开始threadsPerBlock。我们只希望索引小于该值的线程执行任何工作,因此cache[]如果线程的索引小于,我们有条件地添加两个条目i。我们在一个块内保护我们的添加if(cacheIndex < i)。每个线程将获取其索引处的条目cache[],将其添加到其索引偏移量处的条目i,并将该总和存储回cache[]。
For the first step, we start with i as half the number of threadsPerBlock. We only want the threads with indices less than this value to do any work, so we conditionally add two entries of cache[] if the thread’s index is less than i. We protect our addition within an if(cacheIndex < i) block. Each thread will take the entry at its index in cache[], add it to the entry at its index offset by i, and store this sum back to cache[].
假设有 8 个条目cache[],结果i值为 4。缩减的一步如图5.4所示。
Suppose there were eight entries in cache[] and, as a result, i had the value 4. One step of the reduction would look like Figure 5.4.
完成一个步骤后,我们会受到与计算所有成对乘积后相同的限制。在读取刚刚存储在 中的值之前cache[],我们需要确保每个需要写入的线程都cache[]已经这样做了。分配后__syncthreads()确保满足此条件。
After we have completed a step, we have the same restriction we did after computing all the pairwise products. Before we can read the values we just stored in cache[], we need to ensure that every thread that needs to write to cache[] has already done so. The __syncthreads() after the assignment ensures this condition is met.
循环终止后while(),每个块只剩下一个数字。该数字位于第一个条目中cache[],是该块中线程计算的每个成对乘积的总和。然后我们将这个单个值存储到全局内存并结束我们的内核:
After termination of this while() loop, each block has but a single number remaining. This number is sitting in the first entry of cache[] and is the sum of every pairwise product the threads in that block computed. We then store this single value to global memory and end our kernel:
为什么我们只为线程做这个全局存储cacheIndex == 0?好吧,由于只有一个数字需要写入全局内存,因此只有一个线程需要执行此操作。可以想象,每个线程都可以执行此写入,并且程序仍然可以工作,但这样做会创建不必要的大量内存流量来写入单个值。为简单起见,我们选择了索引为 0 的线程,尽管您可以选择任意一个cacheIndex来将cache[0] 写入全局内存。最后,由于每个块都会将一个值写入全局数组c[],因此我们可以简单地通过 对其进行索引blockIdx。
Why do we do this global store only for the thread with cacheIndex == 0? Well, since there is only one number that needs writing to global memory, only a single thread needs to perform this operation. Conceivably, every thread could perform this write and the program would still work, but doing so would create an unnecessarily large amount of memory traffic to write a single value. For simplicity, we chose the thread with index 0, though you could conceivably have chosen any cacheIndex to write cache[0] to global memory. Finally, since each block will write exactly one value to the global array c[], we can simply index it by blockIdx.
我们留下一个数组c[],其中的每个条目都包含其中一个并行块产生的总和。点积的最后一步是对 的条目求和c[]。即使点积尚未完全计算,我们也会在此时退出内核并将控制权返回给主机。但为什么我们要在计算完成之前返回主机呢?
We are left with an array c[], each entry of which contains the sum produced by one of the parallel blocks. The last step of the dot product is to sum the entries of c[]. Even though the dot product is not fully computed, we exit the kernel and return control to the host at this point. But why do we return to the host before the computation is complete?
之前,我们将点积这样的操作称为归约。粗略地说,这是因为我们生成的输出数据元素少于输入的数据元素。在点积的情况下,无论输入的大小如何,我们总是只产生一个输出。事实证明,像 GPU 这样的大规模并行机器在执行最后的归约步骤时往往会浪费资源,因为此时数据集的大小非常小;用480个运算单元来加32个数是很困难的!
Previously, we referred to an operation like a dot product as a reduction. Roughly speaking, this is because we produce fewer output data elements than we input. In the case of a dot product, we always produce exactly one output, regardless of the size of our input. It turns out that a massively parallel machine like a GPU tends to waste its resources when performing the last steps of a reduction, since the size of the data set is so small at that point; it is hard to utilize 480 arithmetic units to add 32 numbers!
为此,我们将控制权返回给主机,让CPU完成加法的最后一步,对数组求和c[]。在更大的应用程序中,GPU 现在可以自由地启动另一个点积或进行另一个大型计算。然而,在这个例子中,我们已经完成了 GPU 的工作。
For this reason, we return control to the host and let the CPU finish the final step of the addition, summing the array c[]. In a larger application, the GPU would now be free to start another dot product or work on another large computation. However, in this example, we are done with the GPU.
在解释这个例子时,我们打破了传统,直接跳入实际的内核计算。我们希望您能够轻松理解main()内核调用的主体,因为它与我们之前展示的内容极其相似。
In explaining this example, we broke with tradition and jumped right into the actual kernel computation. We hope you will have no trouble understanding the body of main() up to the kernel call, since it is overwhelmingly similar to what we have shown before.
为了避免您感到无聊,我们将快速总结一下这段代码:
To avoid you passing out from boredom, we will quickly summarize this code:
1. 为输入和输出数组分配主机和设备内存。
1. Allocate host and device memory for input and output arrays.
2. 填充输入数组a[]和b[],然后使用 将它们复制到设备cudaMemcpy()。
2. Fill input arrays a[] and b[], and then copy these to the device using cudaMemcpy().
3. 使用每个块和每个网格的块的预定数量的线程来调用我们的点积内核。
3. Call our dot product kernel using some predetermined number of threads per block and blocks per grid.
尽管现在大部分内容对您来说都很常见,但值得检查我们启动的块数量的计算。我们讨论了点积如何减少以及每个启动的块如何计算部分和。这个部分和列表的长度对于 CPU 来说应该是可以管理的小值,但又足够大,这样我们就有足够的块在运行,甚至可以让最快的 GPU 保持忙碌。我们选择了 32 个块,尽管在这种情况下您可能会注意到其他选择的性能更好或更差,尤其取决于 CPU 和 GPU 的相对速度。
Despite most of this being commonplace to you now, it is worth examining the computation for the number of blocks we launch. We discussed how the dot product is a reduction and how each block launched will compute a partial sum. The length of this list of partial sums should be something manageably small for the CPU yet large enough such that we have enough blocks in flight to keep even the fastest GPUs busy. We have chosen 32 blocks, although this is a case where you may notice better or worse performance for other choices, especially depending on the relative speeds of your CPU and GPU.
但是,如果我们得到一个非常短的列表,而 32 个块(每个块 256 个线程)太多了怎么办?如果我们有N数据元素,我们只需要N线程来计算点积。所以在这种情况下,我们需要threadsPerBlock大于或等于 的最小倍数N。我们之前在添加向量时见过一次这种情况。在这种情况下,我们通过计算得到threadsPerBlock大于或等于的最小倍数。正如您可能知道的那样,这实际上是整数数学中相当常见的技巧,因此即使您大部分时间都在 CUDA C 领域之外工作,也值得消化这一点。N(N+(threadsPerBlock-1)) / threadsPerBlock
But what if we are given a very short list and 32 blocks of 256 threads apiece is too many? If we have N data elements, we need only N threads in order to compute our dot product. So in this case, we need the smallest multiple of threadsPerBlock that is greater than or equal to N. We have seen this once before when we were adding vectors. In this case, we get the smallest multiple of threadsPerBlock that is greater than or equal to N by computing (N+(threadsPerBlock-1)) / threadsPerBlock. As you may be able to tell, this is actually a fairly common trick in integer math, so it is worth digesting this even if you spend most of your time working outside the CUDA C realm.
因此,我们启动的区块数量应该是 32 或(N+(threadsPerBlock-1)) / threadsPerBlock,以较小的值为准。
Therefore, the number of blocks we launch should be either 32 or (N+(threadsPerBlock-1)) / threadsPerBlock, whichever value is smaller.
现在我们应该清楚我们是如何得到main().内核完成后,我们仍然需要对结果求和。但就像我们在启动内核之前将输入复制到 GPU 的方式一样,我们需要在继续使用它之前将输出复制回 CPU。因此,内核完成后,我们将部分求和列表复制回并在 CPU 上完成求和。
Now it should be clear how we arrive at the code in main(). After the kernel finishes, we still have to sum the result. But like the way we copy our input to the GPU before we launch a kernel, we need to copy our output back to the CPU before we continue working with it. So after the kernel finishes, we copy back the list of partial sums and complete the sum on the CPU.
最后,我们检查结果并清理在 CPU 和 GPU 上分配的内存。检查结果变得更加容易,因为我们已经用可预测的数据填充了输入。如果你还记得的话,a[]是用从 0 到 的整数填充的,N-1并且b[]只是2*a[]。
Finally, we check our results and clean up the memory we’ve allocated on both the CPU and GPU. Checking the results is made easier because we’ve filled the inputs with predictable data. If you recall, a[] is filled with the integers from 0 to N-1 and b[] is just 2*a[].
我们的点积应该是 0 到 之间的整数平方和的两倍N-1。对于热爱离散数学的读者(还有什么理由不喜欢?!),推导出这个求和的封闭式解将是一个有趣的消遣。对于那些缺乏耐心或兴趣的人,我们在这里展示封闭形式,以及 的其余部分main():
Our dot product should be two times the sum of the squares of the integers from 0 to N-1. For the reader who loves discrete mathematics (and what’s not to love?!), it will be an amusing diversion to derive the closed-form solution for this summation. For those with less patience or interest, we present the closed-form here, as well as the rest of the body of main():
如果您觉得我们所有的解释性中断令人烦恼,这里是完整的源列表,没有评论:
If you found all our explanatory interruptions bothersome, here is the entire source listing, sans commentary:
我们很快就忽略了__syncthreads()点积示例中的第二个。现在我们将仔细研究它并研究改进它的尝试。如果您还记得的话,我们需要第二个,__syncthreads()因为我们更新共享内存变量cache[],并需要在循环的下一次迭代中每个线程都可以看到这些更新。
We quickly glossed over the second __syncthreads() in the dot product example. Now we will take a closer look at it as well as examining an attempt to improve it. If you recall, we needed the second __syncthreads() because we update our shared memory variable cache[] and need these updates to be visible to every thread on the next iteration through the loop.
cache[]请注意,只有当cacheIndex小于 时,我们才会更新共享内存缓冲区i。由于cacheIndex实际上只是threadIdx.x,这意味着只有部分线程正在更新共享内存缓存中的条目。由于我们只是用来__syncthreads确保这些更新在继续之前已经发生,因此如果我们只等待实际写入共享内存的线程,我们可能会看到速度的提高。我们通过将同步调用移动到if()块内来做到这一点:
Observe that we update our shared memory buffer cache[] only if cacheIndex is less than i. Since cacheIndex is really just threadIdx.x, this means that only some of the threads are updating entries in the shared memory cache. Since we are using __syncthreads only to ensure that these updates have taken place before proceeding, it stands to reason that we might see a speed improvement if we wait only for the threads that are actually writing to shared memory. We do this by moving the synchronization call inside the if() block:
尽管这是在优化方面做出的勇敢努力,但实际上并不会起作用。事实上,情况比这更糟糕。对内核的这种更改实际上会导致 GPU 停止响应,迫使您终止程序。但是,这样一个看似无害的改变怎么可能会出现如此灾难性的错误呢?
Although this was a valiant effort at optimization, it will not actually work. In fact, the situation is worse than that. This change to the kernel will actually cause the GPU to stop responding, forcing you to kill your program. But what could have gone so catastrophically wrong with such a seemingly innocuous change?
为了回答这个问题,想象一下块中的每个线程一次一行地执行代码会有所帮助。在程序中的每条指令中,每个线程都执行相同的指令,但每个线程可以操作不同的数据。但是当每个线程都应该执行的指令时会发生什么位于条件块内,例如if()?显然不是每个线程都应该执行该指令,对吧?例如,考虑一个包含以下代码片段的内核,该代码片段旨在让奇数索引线程更新某个变量的值:
To answer this question, it helps to imagine every thread in the block marching through the code one line at a time. At each instruction in the program, every thread executes the same instruction, but each can operate on different data. But what happens when the instruction that every thread is supposed to execute is inside a conditional block like an if()? Obviously not every thread should execute that instruction, right? For example, consider a kernel that contains the following fragment of code that intends for odd-indexed threads to update the value of some variable:
在前面的示例中,当线程到达粗体行时,只有具有奇数索引的线程才会执行它,因为具有偶数索引的线程不满足条件if( threadIdx.x % 2 )。当奇数线程执行该指令时,偶数线程不执行任何操作。当某些线程需要执行指令而其他线程不需要时,这种情况称为线程发散。在正常情况下,发散分支只会导致一些线程保持空闲,而其他线程实际执行分支中的指令。
In the previous example, when the threads arrive at the line in bold, only the threads with odd indices will execute it since the threads with even indices do not satisfy the condition if( threadIdx.x % 2 ). The even-numbered threads simply do nothing while the odd threads execute this instruction. When some of the threads need to execute an instruction while others don’t, this situation is known as thread divergence. Under normal circumstances, divergent branches simply result in some threads remaining idle, while the other threads actually execute the instructions in the branch.
但就本案而言__syncthreads(),结果却有些悲惨。 CUDA 架构保证没有线程会前进到超出 的指令,__syncthreads()直到块中的每个__syncthreads()线程都执行了。不幸的是,如果__syncthreads()位于发散分支中,某些线程将永远不会到达__syncthreads().因此,由于保证在__syncthreads()每个线程都执行完 a 之后的指令之前不能执行该指令,因此硬件只是继续等待这些线程。并等待。并等待。永远。
But in the case of __syncthreads(), the result is somewhat tragic. The CUDA Architecture guarantees that no thread will advance to an instruction beyond the __syncthreads() until every thread in the block has executed the __syncthreads(). Unfortunately, if the __syncthreads() sits in a divergent branch, some of the threads will never reach the __syncthreads(). Therefore, because of the guarantee that no instruction after a __syncthreads() can be executed before every thread has executed it, the hardware simply continues to wait for these threads. And waits. And waits. Forever.
__syncthreads()这是当我们将调用移动到块内时点积示例中的情况if()。任何cacheIndex大于或等于 的线程i将永远不会执行__syncthreads().这实际上会挂起处理器,因为它会导致 GPU 等待永远不会发生的事情。
This is the situation in the dot product example when we move the __syncthreads() call inside the if() block. Any thread with cacheIndex greater than or equal to i will never execute the __syncthreads(). This effectively hangs the processor because it results in the GPU waiting for something that will never happen.
这个故事的寓意是,这__syncthreads()是一种强大的机制,可以确保您的大规模并行应用程序仍然计算出正确的结果。但由于这种潜在的意外后果,我们在使用它时仍然需要小心。
The moral of this story is that __syncthreads() is a powerful mechanism for ensuring that your massively parallel application still computes the correct results. But because of this potential for unintended consequences, we still need to take care when using it.
我们已经研究了使用共享内存的示例,并__syncthreads()在继续之前确保数据已准备好。以速度的名义,您可能会想要危险地生活并忽略__syncthreads().现在我们将看一个需要__syncthreads()正确性的图形示例。我们将向您展示预期输出以及不使用__syncthreads().不会很漂亮。
We have looked at examples that use shared memory and employed __syncthreads() to ensure that data is ready before we continue. In the name of speed, you may be tempted to live dangerously and omit the __syncthreads(). We will now look at a graphical example that requires __syncthreads() for correctness. We will show you screenshots of the intended output and of the output when run without __syncthreads(). It won’t be pretty.
的主体main()与 GPU Julia Set 示例相同,尽管这次我们每个块启动多个线程:
The body of main() is identical to the GPU Julia Set example, although this time we launch multiple threads per block:
与 Julia Set 示例一样,每个线程都将计算单个输出位置的像素值。每个线程所做的第一件事是计算其在输出图像中的位置x。y此计算与向量加法示例中的计算相同tid,尽管我们这次是在二维中计算的:
As with the Julia Set example, each thread will be computing a pixel value for a single output location. The first thing that each thread does is compute its x and y location in the output image. This computation is identical to the tid computation in the vector addition example, although we compute it in two dimensions this time:
由于我们将使用共享内存缓冲区来缓存我们的计算,因此我们声明一个共享内存缓冲区,以便 16 x 16 块中的每个线程都有一个条目。
Since we will be using a shared memory buffer to cache our computations, we declare one such that each thread in our 16 x 16 block has an entry.
__shared__ 浮点共享[16][16];
__shared__ float shared[16][16];
然后,每个线程计算一个要存储到该缓冲区中的值。
Then, each thread computes a value to be stored into this buffer.
And lastly, we store these values back out to the pixel, reversing the order of x and y:
诚然,这些计算有些随意。我们只是想出了一些可以绘制绿色球形斑点网格的东西。因此,编译并运行该内核后,我们输出如图5.5所示的图像。
Granted, these computations are somewhat arbitrary. We’ve simply come up with something that will draw a grid of green spherical blobs. So after compiling and running this kernel, we output an image like the one in Figure 5.5.
这里发生了什么?正如您可能从我们设置此示例的方式中猜到的那样,我们缺少一个重要的同步点。当线程将计算值存储到shared[][]像素中时,负责写入该值的线程可能shared[][]尚未完成写入。保证这种情况不会发生的唯一方法是使用__syncthreads().因此,结果是绿色斑点的损坏图片。
What happened here? As you may have guessed from the way we set up this example, we’re missing an important synchronization point. When a thread stores the computed value in shared[][] to the pixel, it is possible that the thread responsible for writing that value to shared[][] has not finished writing it yet. The only way to guarantee that this does not happen is by using __syncthreads(). Thus, the result is a corrupted picture of green blobs.
Figure 5.5 Ascreenshot rendered without proper synchronization
尽管这可能不是世界末日,但您的应用程序可能正在计算更重要的值。
Although this may not be the end of the world, your application might be computing more important values.
相反,我们需要在共享内存的写入和随后的读取之间添加一个同步点。
Instead, we need to add a synchronization point between the write to shared memory and the subsequent read from it.
有了这个__syncthreads(),我们就会得到一个更可预测(并且美观)的结果,如图5.6所示。
With this __syncthreads() in place, we then get a far more predictable (and aesthetically pleasing) result, as shown in Figure 5.6.
Figure 5.6 A screenshot after adding the correct synchronization
我们知道如何将块细分为更小的并行执行单元(称为线程)。我们回顾了上一章的向量加法示例,了解如何执行任意长向量的加法。我们还展示了一个减少的示例以及如何使用共享内存和同步来实现这一点。事实上,这个例子展示了GPU和CPU如何协作计算结果。最后,我们展示了当我们忽视同步需求时,应用程序会面临多么危险的情况。
We know how blocks can be subdivided into smaller parallel execution units known as threads. We revisited the vector addition example of the previous chapter to see how to perform addition of arbitrarily long vectors. We also showed an example of reduction and how we use shared memory and synchronization to accomplish this. In fact, this example showed how the GPU and CPU can collaborate on computing results. Finally, we showed how perilous it can be to an application when we neglect the need for synchronization.
您已经了解了 CUDA C 的大部分基础知识以及它与标准 C 的一些相似之处以及它与标准 C 的许多重要不同之处。这将是考虑您遇到的一些问题和问题的绝佳时机。哪些可能适合使用 CUDA C 并行实现。随着我们的进展,我们将了解可用于在 GPU 上完成任务的一些其他功能,以及 CUDA 为我们提供的一些更高级的 API 功能。
You have learned most of the basics of CUDA C as well as some of the ways it resembles standard C and a lot of the important ways it differs from standard C. This would be an excellent time to consider some of the problems you have encountered and which ones might lend themselves to parallel implementations with CUDA C. As we progress, we will look at some of the other features we can use to accomplish tasks on the GPU, as well as some of the more advanced API features that CUDA provides to us.
我们希望您已经了解了有关编写在 GPU 上执行的代码的更多知识。您应该知道如何生成并行块来执行内核,并且应该知道如何进一步将这些块拆分为并行线程。您还了解了在这些线程之间启用通信和同步的方法。但由于本书尚未结束,您可能已经猜到 CUDA C 还有更多可能对您有用的功能。
We hope you have learned much about writing code that executes on the GPU. You should know how to spawn parallel blocks to execute your kernels, and you should know how to further split these blocks into parallel threads. You have also seen ways to enable communication and synchronization between these threads. But since the book is not over yet, you may have guessed that CUDA C has even more features that might be useful to you.
本章将向您介绍一些更高级的功能。具体来说,您可以通过多种方式利用 GPU 上的特殊内存区域来加速您的应用程序。在本章中,我们将讨论这些内存区域之一:常量内存。此外,由于我们正在研究增强 CUDA C 应用程序性能的第一种方法,因此您还将学习如何使用 CUDA事件测量应用程序的性能。通过这些测量,您将能够量化您所做的任何增强的增益(或损失!)。
This chapter will introduce you to a couple of these more advanced features. Specifically, there exist ways in which you can exploit special regions of memory on your GPU in order to accelerate your applications. In this chapter, we will discuss one of these regions of memory: constant memory. In addition, because we are looking at our first method for enhancing the performance of your CUDA C applications, you will also learn how to measure the performance of your applications using CUDA events. From these measurements, you will be able to quantify the gain (or loss!) from any enhancements you make.
通过本章的课程,您将完成以下任务:
Through the course of this chapter, you will accomplish the following:
• 您将了解如何在 CUDA C 中使用常量内存。
• You will learn about using constant memory with CUDA C.
• 您将了解恒定内存的性能特征。
• You will learn about the performance characteristics of constant memory.
• 您将学习如何使用CUDA 事件来衡量应用程序性能。
• You will learn how to use CUDA events to measure application performance.
之前,我们讨论了现代 GPU 如何配备海量算术处理能力。事实上,图形处理器相对于 CPU 的计算优势有助于激发人们最初对使用图形处理器进行通用计算的兴趣。 GPU上有数百个运算单元,瓶颈通常不是芯片的运算吞吐量,而是芯片的内存带宽。图形处理器上有如此多的 ALU,有时我们无法保持足够快的输入速度来维持如此高的计算速率。因此,值得研究可以减少给定问题所需的内存流量的方法。
Previously, we discussed how modern GPUs are equipped with enormous amounts of arithmetic processing power. In fact, the computational advantage graphics processors have over CPUs helped precipitate the initial interest in using graphics processors for general-purpose computing. With hundreds of arithmetic units on the GPU, often the bottleneck is not the arithmetic throughput of the chip but rather the memory bandwidth of the chip. There are so many ALUs on graphics processors that sometimes we just can’t keep the input coming to them fast enough to sustain such high rates of computation. So, it is worth investigating means by which we can reduce the amount of memory traffic required for a given problem.
到目前为止,我们已经看到使用全局内存和共享内存的 CUDA C 程序。然而,该语言提供了另一种内存,称为常量内存。顾名思义,我们使用常量内存来存储在内核执行过程中不会改变的数据。 NVIDIA 硬件提供 64KB 恒定内存,其处理方式与标准全局内存不同。在某些情况下,使用常量内存而不是全局内存将减少所需的内存带宽。
We have seen CUDA C programs that have used both global and shared memory so far. However, the language makes available another kind of memory known as constant memory. As the name may indicate, we use constant memory for data that will not change over the course of a kernel execution. NVIDIA hardware provides 64KB of constant memory that it treats differently than it treats standard global memory. In some situations, using constant memory rather than global memory will reduce the required memory bandwidth.
我们将研究在简单的光线追踪应用程序中利用常量内存的一种方法。首先,我们将为您介绍光线追踪背后的主要概念的一些背景知识。如果您已经熟悉光线追踪背后的概念,则可以跳至“GPU 上的光线追踪”部分。
We will look at one way of exploiting constant memory in the context of a simple ray tracing application. First, we will give you some background in the major concepts behind ray tracing. If you are already comfortable with the concepts behind ray tracing, you can skip to the “Ray Tracing on the GPU” section.
简而言之,光线追踪是生成由三维对象组成的场景的二维图像的一种方法。但这不正是 GPU 最初设计的目的吗?当您玩自己喜欢的游戏时,这与 OpenGL 或 DirectX 的操作有何不同?嗯,GPU 确实解决了同样的问题,但它们使用了一种称为光栅化的技术。关于光栅化有很多优秀的书籍,因此我们不会在这里尽力解释其中的差异。可以说它们是解决同一问题的完全不同的方法。
Simply put, ray tracing is one way of producing a two-dimensional image of a scene consisting of three-dimensional objects. But isn’t this what GPUs were originally designed for? How is this different from what OpenGL or DirectX do when you play your favorite game? Well, GPUs do indeed solve this same problem, but they use a technique known as rasterization. There are many excellent books on rasterization, so we will not endeavor to explain the differences here. It suffices to say that they are completely different methods that solve the same problem.
那么,光线追踪如何生成三维场景的图像呢?这个想法很简单:我们在场景中选择一个位置来放置一个假想的相机。这款简化的数码相机包含一个光传感器,因此要生成图像,我们需要确定什么光会照射到该传感器。生成的图像的每个像素应该与照射到该点传感器的光线具有相同的颜色和强度。
So, how does ray tracing produce an image of a three-dimensional scene? The idea is simple: We choose a spot in our scene to place an imaginary camera. This simplified digital camera contains a light sensor, so to produce an image, we need to determine what light would hit that sensor. Each pixel of the resulting image should be the same color and intensity of the ray of light that hits that spot sensor.
由于入射到传感器上任何一点的光都可以来自场景中的任何地方,因此事实证明逆向工作更容易。也就是说,我们不是试图找出什么光线射到了相关像素,而是想象一下从像素射出光线进入场景会怎样?通过这种方式,每个像素的行为就像一只“看着”场景的眼睛。图 6.1说明了这些光线从每个像素投射到场景中。
Since light incident at any point on the sensor can come from any place in our scene, it turns out it’s easier to work backward. That is, rather than trying to figure out what light ray hits the pixel in question, what if we imagine shooting a ray from the pixel and into the scene? In this way, each pixel behaves something like an eye that is “looking” into the scene. Figure 6.1 illustrates these rays being cast out of each pixel and into the scene.
Figure 6.1 A simple ray tracing scheme
我们通过跟踪来自相关像素的光线穿过场景直到它击中我们的一个物体来找出每个像素看到的颜色。然后我们说像素将“看到”这个物体,并且可以根据它看到的物体的颜色分配它的颜色。光线追踪所需的大部分计算是在计算光线与场景中物体的交点。
We figure out what color is seen by each pixel by tracing a ray from the pixel in question through the scene until it hits one of our objects. We then say that the pixel would “see” this object and can assign its color based on the color of the object it sees. Most of the computation required by ray tracing is in the computation of these intersections of the ray with the objects in the scene.
而且,在更复杂的光线追踪模型中,场景中闪亮的物体可以反射光线,半透明的物体可以折射光线。这会产生二次光线、三次光线等。事实上,这是光线追踪的吸引人的功能之一;让基本的光线追踪器工作起来非常简单,但是我们可以在光线追踪器中构建更复杂现象的模型,以产生更真实的图像。
Moreover, in more complex ray tracing models, shiny objects in the scene can reflect rays, and translucent objects can refract the rays of light. This creates secondary rays, tertiary rays, and so on. In fact, this is one of the attractive features of ray tracing; it is very simple to get a basic ray tracer working, but we can build models of more complex phenomena into the ray tracer in order to produce more realistic images.
由于 OpenGL 和 DirectX 等 API 的设计不允许进行光线追踪渲染,因此我们必须使用 CUDA C 来实现基本的光线追踪器。我们的光线追踪器将非常简单,以便我们可以专注于常量内存的使用,因此,如果您期望代码可以构成成熟的生产渲染器的基础,那么您会感到失望。我们的基本光线追踪器仅支持球体场景,并且相机仅限于 z 轴,面向原点。此外,我们不会支持场景的任何照明,以避免二次光线的并发症。我们将简单地为每个球体分配一种颜色,然后使用一些预先计算的函数(如果它们可见)对它们进行着色,而不是计算照明效果。
Since APIs such as OpenGL and DirectX are not designed to allow ray-traced rendering, we will have to use CUDA C to implement our basic ray tracer. Our ray tracer will be extraordinarily simple so that we can concentrate on the use of constant memory, so if you were expecting code that could form the basis of a full-blown production renderer, you will be disappointed. Our basic ray tracer will only support scenes of spheres, and the camera is restricted to the z-axis, facing the origin. Moreover, we will not support any lighting of the scene to avoid the complications of secondary rays. Instead of computing lighting effects, we will simply assign each sphere a color and then shade them with some precomputed function if they are visible.
那么,光线追踪器会做什么呢?它将从每个像素发射一条射线并跟踪哪些射线击中哪些球体。它还将跟踪每次点击的深度。在光线穿过多个球体的情况下,只能看到距离相机最近的球体。本质上,我们的“光线追踪器”只不过是隐藏相机无法看到的表面。
So, what will the ray tracer do? It will fire a ray from each pixel and keep track of which rays hit which spheres. It will also track the depth of each of these hits. In the case where a ray passes through multiple spheres, only the sphere closest to the camera can be seen. In essence, our “ray tracer” is not doing much more than hiding surfaces that cannot be seen by the camera.
我们将使用一个数据结构来建模我们的球体,该数据结构存储球体的中心坐标(x, y, z)、radius以及颜色(r, b, g)。
We will model our spheres with a data structure that stores the sphere’s center coordinate of (x, y, z), its radius, and its color of (r, b, g).
您还会注意到该结构有一个名为 的方法hit( float ox, float oy, float *n )。给定从 处的像素发出的光线(ox, oy),此方法计算光线是否与球体相交。如果光线确实与球体相交,则该方法会计算光线撞击球体时距摄像机的距离。我们需要这一信息的原因是前面提到的:如果光线击中多个球体,实际上只能看到最近的球体。
You will also notice that the structure has a method called hit( float ox, float oy, float *n ). Given a ray shot from the pixel at (ox, oy), this method computes whether the ray intersects the sphere. If the ray does intersect the sphere, the method computes the distance from the camera where the ray hits the sphere. We need this information for the reason mentioned before: In the event that the ray hits more than one sphere, only the closest sphere can actually be seen.
我们的main()例程遵循与之前的图像生成示例大致相同的顺序。
Our main() routine follows roughly the same sequence as our previous image-generating examples.
我们为输入数据分配内存,输入数据是组成场景的球体数组。由于我们需要 GPU 上的这些数据,但要使用 CPU 生成它,因此我们必须同时执行 acudaMalloc() 和amalloc()来在 GPU 和 CPU 上分配内存。我们还分配一个位图图像,当我们在 GPU 上对球体进行光线追踪时,我们将用输出像素数据填充该位图图像。
We allocate memory for our input data, which is an array of spheres that compose our scene. Since we need this data on the GPU but are generating it with the CPU, we have to do both a cudaMalloc() and a malloc() to allocate memory on both the GPU and the CPU. We also allocate a bitmap image that we will fill with output pixel data as we ray trace our spheres on the GPU.
为输入和输出分配内存后,我们随机生成球体的中心坐标、颜色和半径:
After allocating memory for input and output, we randomly generate the center coordinate, color, and radius for our spheres:
该程序当前生成 20 个球体的随机数组,但该数量在 a 中指定,#define并且可以相应调整。
The program currently generates a random array of 20 spheres, but this quantity is specified in a #define and can be adjusted accordingly.
我们将这个球体数组复制到 GPU cudaMemcpy(),然后释放临时缓冲区。
We copy this array of spheres to the GPU using cudaMemcpy() and then free the temporary buffer.
现在我们的输入位于 GPU 上并且我们已经为输出分配了空间,我们准备启动内核。
Now that our input is on the GPU and we have allocated space for the output, we are ready to launch our kernel.
我们稍后将检查内核本身,但现在您应该相信它会光线跟踪场景并为球体的输入场景生成像素数据。最后,我们将输出图像从 GPU 复制回来并显示。不言而喻,我们释放了所有尚未释放的已分配内存。
We will examine the kernel itself in a moment, but for now you should take it on faith that it ray traces the scene and generates pixel data for the input scene of spheres. Finally, we copy the output image back from the GPU and display it. It should go without saying that we free all allocated memory that hasn’t already been freed.
所有这些现在对你来说应该是司空见惯的。那么,我们如何进行实际的光线追踪呢?因为我们已经确定了一个非常简单的光线追踪模型,所以我们的内核将非常容易理解。每个线程都为我们的输出图像生成一个像素,因此我们以通常的方式开始计算线程的x- 和y- 坐标以及线性化offset到输出缓冲区中。我们还将移动(x,y)图像坐标,DIM/2使 z 轴穿过图像的中心。
All of this should be commonplace to you now. So, how do we do the actual ray tracing? Because we have settled on a very simple ray tracing model, our kernel will be very easy to understand. Each thread is generating one pixel for our output image, so we start in the usual manner by computing the x- and y-coordinates for the thread as well as the linearized offset into our output buffer. We will also shift our (x,y) image coordinates by DIM/2 so that the z-axis runs through the center of the image.
由于每条射线都需要检查每个球体是否相交,因此我们现在将迭代球体数组,检查每个球体是否有碰撞。
Since each ray needs to check each sphere for intersection, we will now iterate through the array of spheres, checking each for a hit.
显然,大部分有趣的计算都在for()循环中。我们迭代每个输入球体并调用其hit()方法来确定来自像素的光线是否“看到”球体。如果射线击中当前球体,我们将确定该击中是否比我们击中的最后一个球体更靠近相机。如果更接近,我们会将这个深度存储为新的最接近的球体。此外,我们存储与该球体关联的颜色,以便当循环终止时,线程知道最接近相机的球体的颜色。由于这是来自像素的光线“看到”的颜色,因此我们得出结论,这是像素的颜色并将该值存储在输出图像缓冲区中。
Clearly, the majority of the interesting computation lies in the for() loop. We iterate through each of the input spheres and call its hit() method to determine whether the ray from our pixel “sees” the sphere. If the ray hits the current sphere, we determine whether the hit is closer to the camera than the last sphere we hit. If it is closer, we store this depth as our new closest sphere. In addition, we store the color associated with this sphere so that when the loop has terminated, the thread knows the color of the sphere that is closest to the camera. Since this is the color that the ray from our pixel “sees,” we conclude that this is the color of the pixel and store this value in our output image buffer.
检查每个球体的交集后,我们可以将当前颜色存储到输出图像中。
After every sphere has been checked for intersection, we can store the current color into the output image.
请注意,如果没有球体被击中,我们存储的颜色将是我们初始化变量r、b和gto 的颜色。在本例中,我们将r、b和设置g为零,因此背景将为黑色。您可以更改这些值以呈现不同的颜色背景。图 6.2显示了使用 20 个球体和黑色背景渲染时输出的示例。
Note that if no spheres have been hit, the color that we store will be whatever color we initialized the variables r, b, and g to. In this case, we set r, b, and g to zero so the background will be black. You can change these values to render a different color background. Figure 6.2 shows an example of what the output should look like when rendered with 20 spheres and a black background.
Figure 6.2 A screenshot from the ray tracing example
由于我们随机生成球体位置、颜色和大小,因此如果您的输出与此图像不完全相同,我们建议您不要惊慌。
Since we randomly generated the sphere positions, colors, and sizes, we advise you not to panic if your output doesn’t match this image identically.
您可能已经注意到,我们在光线追踪示例中从未提到过常量内存。现在是时候利用恒定内存的优势来改进这个示例了。由于我们无法修改常量内存,因此我们显然不能将其用于输出图像数据。这个例子只有一个输入,即球体数组,所以我们将在常量内存中存储什么数据应该是非常明显的。
You may have noticed that we never mentioned constant memory in the ray tracing example. Now it’s time to improve this example using the benefits of constant memory. Since we cannot modify constant memory, we clearly can’t use it for the output image data. And this example has only one input, the array of spheres, so it should be pretty obvious what data we will store in constant memory.
声明内存常量的机制与我们用于将缓冲区声明为共享内存的机制相同。而不是像这样声明我们的数组:
The mechanism for declaring memory constant is identical to the one we used for declaring a buffer as shared memory. Instead of declaring our array like this:
球体*s;
Sphere *s;
我们在它前面添加修饰符__constant__:
we add the modifier __constant__before it:
请注意,在原始示例中,我们声明了一个指针,然后用于cudaMalloc()为其分配 GPU 内存。当我们将其更改为常量内存时,我们还更改了声明以在常量内存中静态分配空间。我们不再需要担心球体数组的调用cudaMalloc()或cudaFree(),但我们确实需要在编译时承诺该数组的大小。对于许多应用程序来说,这是对恒定内存的性能优势的可接受的权衡。我们将立即讨论这些好处,但首先我们将看看使用常量内存如何改变我们的main()例程:
Notice that in the original example, we declared a pointer and then used cudaMalloc() to allocate GPU memory for it. When we changed it to constant memory, we also changed the declaration to statically allocate the space in constant memory. We no longer need to worry about calling cudaMalloc() or cudaFree() for our array of spheres, but we do need to commit to a size for this array at compile-time. For many applications, this is an acceptable trade-off for the performance benefits of constant memory. We will talk about these benefits momentarily, but first we will look at how the use of constant memory changes our main() routine:
很大程度上,这与之前的main().正如我们之前提到的,我们不再需要调用来cudaMalloc()分配我们的球体阵列的空间。清单中突出显示了另一个更改:
Largely, this is identical to the previous implementation of main(). As we mentioned previously, we no longer need the call to cudaMalloc() to allocate space for our array of spheres. The other change has been highlighted in the listing:
cudaMemcpy()当我们从主机内存复制到 GPU 上的常量内存时,我们会使用这个特殊版本。cudaMemcpyToSymbol()和之间功能上的唯一区别cudaMemcpy()是cudaMemcpyToSymbol()可以复制到常量内存并且cudaMemcpy()只能复制到全局内存中的指针。
We use this special version of cudaMemcpy() when we copy from host memory to constant memory on the GPU. The only differences in functionality between cudaMemcpyToSymbol() and cudaMemcpy() is that cudaMemcpyToSymbol() can copy to constant memory and cudaMemcpy() can only copy to pointers in global memory.
__constant__在修饰符和对 的两个更改之外main(),具有和不具有常量内存的版本是相同的。
Outside the __constant__ modifier and the two changes to main(), the versions with and without constant memory are identical.
将内存声明为__constant__将我们的使用限制为只读。在接受这一限制的过程中,我们期望得到一些回报。正如我们之前提到的,与从全局内存读取相同的数据相比,从常量内存读取可以节省内存带宽。与标准全局内存读取相比,从 64KB 恒定内存读取可以节省带宽,原因有两个:
Declaring memory as __constant__ constrains our usage to be read-only. In taking on this constraint, we expect to get something in return. As we previously mentioned, reading from constant memory can conserve memory bandwidth when compared to reading the same data from global memory. There are two reasons why reading from the 64KB of constant memory can save bandwidth over standard reads of global memory:
• 来自常量内存的单次读取可以广播到其他“附近”线程,从而有效地节省最多 15 次读取。
• A single read from constant memory can be broadcast to other “nearby” threads, effectively saving up to 15 reads.
• 常量内存被缓存,因此连续读取同一地址不会产生任何额外的内存流量。
• Constant memory is cached, so consecutive reads of the same address will not incur any additional memory traffic.
附近这个词是什么意思?为了回答这个问题,我们需要解释一下扭曲的概念。对于那些更熟悉《星际迷航》而不是编织的读者来说,这种情况下的扭曲与太空旅行的速度无关。在编织领域,经纱是指编织在一起形成织物的一组线。在 CUDA 架构中,warp是指“编织在一起”并同步执行的 32 个线程的集合。在程序的每一行,warp 中的每个线程都对不同的数据执行相同的指令。
What do we mean by the word nearby? To answer this question, we will need to explain the concept of a warp. For those readers who are more familiar with Star Trek than with weaving, a warp in this context has nothing to do with the speed of travel through space. In the world of weaving, a warp refers to the group of threads being woven together into fabric. In the CUDA Architecture, a warp refers to a collection of 32 threads that are “woven together” and get executed in lockstep. At every line in your program, each thread in a warp executes the same instruction on different data.
在处理常量内存时,NVIDIA 硬件可以将单个内存读取广播到每个 half-warp。半经线(不像经线那样富有创意地命名)是一组 16 根线程:32 线经线的一半。如果 half-warp 中的每个线程都从常量内存中的同一地址请求数据,则 GPU 将仅生成单个读取请求,然后将数据广播到每个线程。如果您从常量内存中读取大量数据,那么您将仅生成使用全局内存时的 1/16(大约 6%)的内存流量。
When it comes to handling constant memory, NVIDIA hardware can broadcast a single memory read to each half-warp. A half-warp—not nearly as creatively named as a warp—is a group of 16 threads: half of a 32-thread warp. If every thread in a half-warp requests data from the same address in constant memory, your GPU will generate only a single read request and subsequently broadcast the data to every thread. If you are reading a lot of data from constant memory, you will generate only 1/16 (roughly 6 percent) of the memory traffic as you would when using global memory.
但在读取恒定内存时,节省的费用并不止于带宽减少 94%!因为我们致力于保持内存不变,所以硬件可以积极地将常量数据缓存在 GPU 上。因此,在第一次从常量内存中的地址读取之后,请求相同地址并因此命中常量高速缓存的其他半扭曲将不会产生额外的内存流量。
But the savings don’t stop at a 94 percent reduction in bandwidth when reading constant memory! Because we have committed to leaving the memory unchanged, the hardware can aggressively cache the constant data on the GPU. So after the first read from an address in constant memory, other half-warps requesting the same address, and therefore hitting the constant cache, will generate no additional memory traffic.
对于我们的光线追踪器,启动中的每个线程都会读取与第一个球体相对应的数据,以便线程可以测试其光线是否相交。在我们修改应用程序以将球体存储在常量内存中之后,硬件只需对该数据发出一次请求。缓存数据后,由于以下两个恒定内存优势之一,所有其他线程都会避免生成内存流量:
In the case of our ray tracer, every thread in the launch reads the data corresponding to the first sphere so the thread can test its ray for intersection. After we modify our application to store the spheres in constant memory, the hardware needs to make only a single request for this data. After caching the data, every other thread avoids generating memory traffic as a result of one of the two constant memory benefits:
• 它以half-warp 广播方式接收数据。
• It receives the data in a half-warp broadcast.
• 它从常量内存高速缓存中检索数据。
• It retrieves the data from the constant memory cache.
不幸的是,使用常量内存可能会降低性能。半扭曲广播功能实际上是一把双刃剑。尽管当所有 16 个线程都读取同一地址时,它可以显着提高性能,但当所有 16 个线程读取不同的地址时,它实际上会降低性能。
Unfortunately, there can potentially be a downside to performance when using constant memory. The half-warp broadcast feature is in actuality a double-edged sword. Although it can dramatically accelerate performance when all 16 threads are reading the same address, it actually slows performance to a crawl when all 16 threads read different addresses.
允许将单个读取广播到 16 个线程的权衡是,允许 16 个线程一次仅发出单个读取请求。例如,如果 half-warp 中的所有 16 个线程都需要来自常量内存的不同数据,则 16 个不同的读取将被序列化,从而实际上花费了 16 倍的时间来发出请求。如果它们从传统的全局存储器中读取,则可以同时发出请求。在这种情况下,从常量内存读取可能比使用全局内存慢。
The trade-off to allowing the broadcast of a single read to 16 threads is that the 16 threads are allowed to place only a single read request at a time. For example, if all 16 threads in a half-warp need different data from constant memory, the 16 different reads get serialized, effectively taking 16 times the amount of time to place the request. If they were reading from conventional global memory, the request could be issued at the same time. In this case, reading from constant memory would probably be slower than using global memory.
您充分意识到可能存在积极或消极的影响,因此已将光线追踪器更改为使用恒定内存。您如何确定这对您的程序的性能有何影响?最简单的指标之一涉及回答这个简单的问题:哪个版本需要更少的时间来完成?我们可以使用 CPU 或操作系统计时器之一,但这将包括来自任意数量来源(操作系统线程调度、高精度 CPU 计时器的可用性等)的延迟和变化。此外,当 GPU 内核运行时,我们可能会在主机上异步执行计算。对这些主机计算进行计时的唯一方法是使用 CPU 或操作系统计时机制。因此,为了测量 GPU 在任务上花费的时间,我们将使用 CUDA 事件 API。
Fully aware that there may be either positive or negative implications, you have changed your ray tracer to use constant memory. How do you determine how this has impacted the performance of your program? One of the simplest metrics involves answering this simple question: Which version takes less time to finish? We could use one of the CPU or operating system timers, but this will include latency and variation from any number of sources (operating system thread scheduling, availability of high-precision CPU timers, and so on). Furthermore, while the GPU kernel runs, we may be asynchronously performing computation on the host. The only way to time these host computations is using the CPU or operating system timing mechanism. So to measure the time a GPU spends on a task, we will use the CUDA event API.
CUDA 中的事件本质上是在用户指定的时间点记录的 GPU 时间戳。由于 GPU 本身会记录时间戳,因此它消除了我们在尝试使用 CPU 计时器对 GPU 执行进行计时时可能遇到的许多问题。该 API 相对易于使用,因为获取时间戳仅包含两个步骤:创建事件并随后记录事件。例如,在某些代码序列的开头,我们指示 CUDA 运行时记录当前时间。我们通过创建然后记录事件来做到这一点:
An event in CUDA is essentially a GPU time stamp that is recorded at a user-specified point in time. Since the GPU itself is recording the time stamp, it eliminates a lot of the problems we might encounter when trying to time GPU execution with CPU timers. The API is relatively easy to use, since taking a time stamp consists of just two steps: creating an event and subsequently recording an event. For example, at the beginning of some sequence of code, we instruct the CUDA runtime to make a record of the current time. We do so by creating and then recording the event:
您会注意到,当我们指示运行时记录事件时start,我们还向它传递了第二个参数。在前面的例子中,这个参数是 0。这个参数的确切性质对于我们现在的目的来说并不重要,所以我们打算让它神秘地无法解释,而不是打开一个新的蠕虫罐头。如果你的好奇心正在扼杀你,我们打算在谈论流时讨论这个问题。
You will notice that when we instruct the runtime to record the event start, we also pass it a second argument. In the previous example, this argument is 0. The exact nature of this argument is unimportant for our purposes right now, so we intend to leave it mysteriously unexplained rather than open a new can of worms. If your curiosity is killing you, we intend to discuss this when we talk about streams.
为了对代码块进行计时,我们需要创建一个开始事件和一个停止事件。当我们启动时,我们将记录 CUDA 运行时,告诉它在 GPU 上做一些其他工作,然后告诉它在我们停止时进行记录:
To time a block of code, we will want to create both a start event and a stop event. We will have the CUDA runtime record when we start, tell it to do some other work on the GPU, and then tell it to record when we’ve stopped:
不幸的是,以这种方式对 GPU 代码进行计时仍然存在问题。该修复只需要一行代码,但需要一些解释。使用事件最棘手的部分是由于我们在 CUDA C 中进行的一些调用实际上是异步的。例如,当我们在光线追踪器中启动内核时,GPU 开始执行我们的代码,但 CPU 在 GPU 完成之前继续执行程序的下一行。从性能的角度来看,这非常好,因为这意味着我们可以同时在 GPU 和 CPU 上计算某些内容,但从概念上讲,这使得计时变得棘手。
Unfortunately, there is still a problem with timing GPU code in this way. The fix will require only one line of code but will require some explanation. The trickiest part of using events arises as a consequence of the fact that some of the calls we make in CUDA C are actually asynchronous. For example, when we launched the kernel in our ray tracer, the GPU begins executing our code, but the CPU continues executing the next line of our program before the GPU finishes. This is excellent from a performance standpoint because it means we can be computing something on the GPU and CPU at the same time, but conceptually it makes timing tricky.
您应该将调用想象cudaEventRecord()为记录当前时间的指令,并将其放入 GPU 的待处理工作队列中。因此,直到 GPU 完成调用 之前的所有操作后,我们的事件才会真正被记录cudaEventRecord()。就让我们的事件测量正确的时间而言stop,这正是我们想要的。但是,在 GPU 完成之前的工作并记录事件之前,我们无法安全地读取事件的值。幸运的是,我们有一种方法可以指示 CPU 同步事件,即事件 API 函数:stopstopcudaEventSynchronize()
You should imagine calls to cudaEventRecord() as an instruction to record the current time being placed into the GPU’s pending queue of work. As a result, our event won’t actually be recorded until the GPU finishes everything prior to the call to cudaEventRecord(). In terms of having our stop event measure the correct time, this is precisely what we want. But we cannot safely read the value of the stop event until the GPU has completed its prior work and recorded the stop event. Fortunately, we have a way to instruct the CPU to synchronize on an event, the event API function cudaEventSynchronize():
现在,我们已指示运行时阻止进一步的指令,直到 GPU 到达该stop事件。当呼叫到cudaEventSynchronize() 返回,我们知道stop事件之前的所有 GPU 工作都已完成,因此读取 中记录的时间戳是安全的stop。值得注意的是,由于 CUDA 事件直接在 GPU 上实现,因此它们不适合设备和主机代码的时序混合。也就是说,如果您尝试使用 CUDA 事件来计时超过涉及设备的内核执行和内存复制的时间,您将得到不可靠的结果。
Now, we have instructed the runtime to block further instruction until the GPU has reached the stop event. When the call to cudaEventSynchronize() returns, we know that all GPU work before the stop event has completed, so it is safe to read the time stamp recorded in stop. It is worth noting that because CUDA events get implemented directly on the GPU, they are unsuitable for timing mixtures of device and host code. That is, you will get unreliable results if you attempt to use CUDA events to time more than kernel executions and memory copies involving the device.
为了给光线追踪器计时,我们需要创建一个开始和停止事件,就像我们在了解事件时所做的那样。以下是光线追踪器的启用计时版本,不使用常量内存:
To time our ray tracer, we will need to create a start and stop event, just as we did when learning about events. The following is a timing-enabled version of the ray tracer that does not use constant memory:
cudaEventElapsedTime()请注意,我们添加了两个附加函数,即对和 的调用cudaEventDestroy()。该函数cudaEventElapsedTime()是一个实用程序,用于计算两个先前记录的事件之间经过的时间。两个事件之间经过的时间(以毫秒为单位)在第一个参数(浮点变量的地址)中返回。
Notice that we have thrown two additional functions into the mix, the calls to cudaEventElapsedTime() and cudaEventDestroy(). The function cudaEventElapsedTime() is a utility that computes the elapsed time between two previously recorded events. The time in milliseconds elapsed between the two events is returned in the first argument, the address of a floating-point variable.
cudaEventDestroy()当我们完成使用由 . 创建的事件时,需要调用cudaEventCreate().这与调用free()先前分配的内存相同,因此我们不必强调将每个与 amalloc()匹配是多么重要。cudaEventCreate()cudaEventDestroy()
The call to cudaEventDestroy() needs to be made when we’re finished using an event created with cudaEventCreate(). This is identical to calling free() on memory previously allocated with malloc(), so we needn’t stress how important it is to match every cudaEventCreate() with a cudaEventDestroy().
我们可以以相同的方式检测使用常量内存的光线追踪器:
We can instrument the ray tracer that does use constant memory in the same fashion:
现在,当我们运行两个版本的光线追踪器时,我们可以比较完成 GPU 工作所需的时间。这将从高层次上告诉我们,引入恒定内存是否提高了应用程序的性能,还是使其性能恶化。幸运的是,在这种情况下,通过使用常量内存可以显着提高性能。我们在 GeForce GTX 280 上进行的实验表明,恒定内存光线追踪器的执行速度比使用全局内存的版本快 50%。在不同的 GPU 上,您的里程可能会有所不同,尽管使用恒定内存的光线追踪器应该始终至少与不使用它的版本一样快。
Now when we run our two versions of the ray tracer, we can compare the time it takes to complete the GPU work. This will tell us at a high level whether introducing constant memory has improved the performance of our application or worsened it. Fortunately, in this case, performance is improved dramatically by using constant memory. Our experiments on a GeForce GTX 280 show the constant memory ray tracer performing up to 50 percent faster than the version that uses global memory. On a different GPU, your mileage might vary, although the ray tracer that uses constant memory should always be at least as fast as the version without it.
除了我们在前面的章节中探讨过的全局内存和共享内存之外,NVIDIA 硬件还提供其他类型的内存供我们使用。与标准全局内存相比,恒定内存具有额外的约束,但在某些情况下,使我们自己受到这些约束可以产生额外的性能。具体来说,当 warp 中的线程需要访问相同的只读数据时,我们可以看到额外的性能。采用这种访问模式对数据使用恒定存储器可以节省带宽,这既是因为能够在半扭曲上广播读取,又是因为芯片上存在恒定存储器高速缓存。内存带宽成为多种算法的瓶颈,因此拥有改善这种情况的机制非常有用。
In addition to the global and shared memory we explored in previous chapters, NVIDIA hardware makes other types of memory available for our use. Constant memory comes with additional constraints over standard global memory, but in some cases, subjecting ourselves to these constraints can yield additional performance. Specifically, we can see additional performance when threads in a warp need access to the same read-only data. Using constant memory for data with this access pattern can conserve bandwidth both because of the capacity to broadcast reads across a half-warp and because of the presence of a constant memory cache on chip. Memory bandwidth bottlenecks a wide class of algorithms, so having mechanisms to ameliorate this situation can prove incredibly useful.
我们还学习了如何使用 CUDA 事件来请求运行时记录 GPU 执行期间特定点的时间戳。我们了解了如何在其中一个事件上同步 CPU 与 GPU,以及如何计算两个事件之间经过的时间。在此过程中,我们建立了一种方法来比较两种不同的光线追踪球体方法之间的运行时间,得出的结论是,对于手头的应用程序,使用恒定内存为我们带来了显着的性能。
We also learned how to use CUDA events to request the runtime to record time stamps at specific points during GPU execution. We saw how to synchronize the CPU with the GPU on one of these events and then how to compute the time elapsed between two events. In doing so, we built up a method to compare the running time between two different methods for ray tracing spheres, concluding that, for the application at hand, using constant memory gained us a significant amount of performance.
当我们研究常量内存时,我们看到在适当的情况下利用特殊内存空间可以显着加速应用程序的速度。我们还学习了如何衡量这些性能提升,以便就性能选择做出明智的决策。在本章中,我们将学习如何分配和使用纹理内存。与常量内存一样,纹理内存是另一种只读内存,当读取具有一定的访问模式时,它可以提高性能并减少内存流量。尽管纹理内存最初是为传统图形应用程序设计的,但它也可以在某些 GPU 计算应用程序中非常有效地使用。
When we looked at constant memory, we saw how exploiting special memory spaces under the right circumstances can dramatically accelerate applications. We also learned how to measure these performance gains in order to make informed decisions about performance choices. In this chapter, we will learn about how to allocate and use texture memory. Like constant memory, texture memory is another variety of read-only memory that can improve performance and reduce memory traffic when reads have certain access patterns. Although texture memory was originally designed for traditional graphics applications, it can also be used quite effectively in some GPU computing applications.
通过本章的课程,您将完成以下任务:
Through the course of this chapter, you will accomplish the following:
• 您将了解纹理内存的性能特征。
• You will learn about the performance characteristics of texture memory.
• 您将学习如何通过 CUDA C 使用一维纹理内存。
• You will learn how to use one-dimensional texture memory with CUDA C.
• 您将学习如何通过 CUDA C 使用二维纹理内存。
• You will learn how to use two-dimensional texture memory with CUDA C.
如果您阅读了本章的介绍,秘密已经揭晓:还有另一种类型的只读存储器可用于用 CUDA C 编写的程序。熟悉图形硬件工作原理的读者不会感到惊讶,但 GPU 复杂的纹理内存也可用于通用计算。尽管 NVIDIA 为经典 OpenGL 和 DirectX 渲染管道设计了纹理单元,但纹理内存具有一些使其对于计算极其有用的属性。
If you read the introduction to this chapter, the secret is already out: There is yet another type of read-only memory that is available for use in your programs written in CUDA C. Readers familiar with the workings of graphics hardware will not be surprised, but the GPU’s sophisticated texture memory may also be used for general-purpose computing. Although NVIDIA designed the texture units for the classical OpenGL and DirectX rendering pipelines, texture memory has some properties that make it extremely useful for computing.
与常量内存一样,纹理内存缓存在芯片上,因此在某些情况下,它将通过减少对片外 DRAM 的内存请求来提供更高的有效带宽。具体来说,纹理缓存是为内存访问模式表现出大量空间局部性的图形应用程序而设计的。在计算应用程序中,这粗略地意味着线程可能从附近线程读取的地址“附近”的地址读取,如图7.1所示。
Like constant memory, texture memory is cached on chip, so in some situations it will provide higher effective bandwidth by reducing memory requests to off-chip DRAM. Specifically, texture caches are designed for graphics applications where memory access patterns exhibit a great deal of spatial locality. In a computing application, this roughly implies that a thread is likely to read from an address “near” the address that nearby threads read, as shown in Figure 7.1.
Figure 7.1 A mapping of threads into a two-dimensional region of memory
从算术上讲,所示的四个地址不是连续的,因此它们不会在典型的 CPU 缓存方案中缓存在一起。但由于 GPU 纹理缓存旨在加速此类访问模式,因此在这种情况下,使用纹理内存而不是全局内存时,您会看到性能的提高。事实上,正如我们将看到的,这种访问模式在通用计算中并不少见。
Arithmetically, the four addresses shown are not consecutive, so they would not be cached together in a typical CPU caching scheme. But since GPU texture caches are designed to accelerate access patterns such as this one, you will see an increase in performance in this case when using texture memory instead of global memory. In fact, this sort of access pattern is not incredibly uncommon in general-purpose computing, as we shall see.
物理模拟可能是计算上最具挑战性的问题之一。从根本上讲,准确性和计算复杂性之间通常需要权衡。因此,计算机模拟近年来变得越来越重要,这在很大程度上要归功于并行计算革命带来的准确性的提高。由于许多物理模拟可以很容易地并行化,因此我们将在本例中查看一个非常简单的模拟模型。
Physical simulations can be among the most computationally challenging problems to solve. Fundamentally, there is often a trade-off between accuracy and computational complexity. As a result, computer simulations have become more and more important in recent years, thanks in large part to the increased accuracy possible as a consequence of the parallel computing revolution. Since many physical simulations can be parallelized quite easily, we will look at a very simple simulation model in this example.
为了演示可以有效利用纹理内存的情况,我们将构建一个简单的二维传热模拟。我们首先假设我们有一些矩形房间,我们将其划分为网格。在网格内,我们将随机散布一些具有各种固定温度的“加热器”。图 7.2显示了这个房间的示例。
To demonstrate a situation where you can effectively employ texture memory, we will construct a simple two-dimensional heat transfer simulation. We start by assuming that we have some rectangular room that we divide into a grid. Inside the grid, we will randomly scatter a handful of “heaters” with various fixed temperatures. Figure 7.2 shows an example of what this room might look like.
Figure 7.2 A room with “heaters” of various temperature
给定矩形网格和加热器配置,我们希望模拟每个网格单元中的温度随着时间的推移会发生什么变化。为简单起见,带有加热器的电池始终保持恒温。在时间的每一步,我们都会假设热量在细胞与其邻居之间“流动”。如果一个细胞的邻居比它本身温暖,那么温暖的邻居往往会使其变暖。相反,如果一个单元的邻居比它的温度低,它就会冷却。图 7.3定性地描述了这种热流。
Given a rectangular grid and configuration of heaters, we are looking to simulate what happens to the temperature in every grid cell as time progresses. For simplicity, cells with heaters in them always remain at a constant temperature. At every step in time, we will assume that heat “flows” between a cell and its neighbors. If a cell’s neighbor is warmer than it is, the warmer neighbor will tend to warm it up. Conversely, if a cell has a neighbor cooler than it is, it will cool off. Qualitatively, Figure 7.3 represents this flow of heat.
Figure 7.3 Heat dissipating from warm cells into cold cells
在我们的传热模型中,我们将计算网格单元中的新温度,作为其温度与其邻居温度之间的差值之和,或者本质上是更新方程,如公式 7.1所示。
In our heat transfer model, we will compute the new temperature in a grid cell as a sum of the differences between its temperature and the temperatures of its neighbors, or, essentially, an update equation as shown in Equation 7.1.
在更新电池温度的方程中,常数k仅表示热量流经模拟的速率。较大的值k将使系统快速达到恒温,而较小的值将使解决方案保持较大的温度梯度更长时间。由于我们仅考虑四个邻居(上、下、左、右)并且k在TOLD方程中保持不变,因此此更新变得类似于方程 7.2中所示的更新。
In the equation for updating a cell’s temperature, the constant k simply represents the rate at which heat flows through the simulation. A large value of k will drive the system to a constant temperature quickly, while a small value will allow the solution to retain large temperature gradients longer. Since we consider only four neighbors (top, bottom, left, right) and k and TOLD remain constant in the equation, this update becomes like the one shown in Equation 7.2.
与前一章中的光线追踪示例一样,该模型并不是为了接近工业中可能使用的模型(事实上,它甚至不是物理上准确的近似值)。我们极大地简化了这个模型,以便引起人们对现有技术的关注。考虑到这一点,我们来看看如何在 GPU 上计算公式 7.2给出的更新。
Like with the ray tracing example in the previous chapter, this model is not intended to be close to what might be used in industry (in fact, it is not really even an approximation of something physically accurate). We have simplified this model immensely in order to draw attention to the techniques at hand. With this in mind, let’s take a look at how the update given by Equation 7.2 can be computed on the GPU.
我们稍后将介绍每个步骤的具体细节,但在较高层面上,我们的更新过程如下:
We will cover the specifics of each step in a moment, but at a high level, our update process proceeds as follows:
1. 给定一些输入温度网格,将带有加热器的电池的温度复制到该网格。这将覆盖这些单元中之前计算的任何温度,从而强制我们限制“加热单元”保持恒定温度。此副本在 中执行copy_const_kernel()。
1. Given some grid of input temperatures, copy the temperature of cells with heaters to this grid. This will overwrite any previously computed temperatures in these cells, thereby enforcing our restriction that “heating cells” remain at a constant temperature. This copy gets performed in copy_const_kernel().
2. 给定输入温度网格,根据公式 7.2中的更新计算输出温度。此更新在 中执行blend_kernel()。
2. Given the input temperature grid, compute the output temperatures based on the update in Equation 7.2. This update gets performed in blend_kernel().
3. 交换输入和输出缓冲区,为下一个时间步骤做准备。步骤 2 中计算的输出温度网格将成为我们在模拟下一个时间步骤时在步骤 1 中开始的输入温度网格。
3. Swap the input and output buffers in preparation of the next time step. The output temperature grid computed in step 2 will become the input temperature grid that we start with in step 1 when simulating the next time step.
在开始模拟之前,我们假设我们已经生成了一个常数网格。该网格中的大多数条目为零,但某些条目包含非零温度,表示固定温度下的加热器。该常量缓冲区不会在模拟过程中发生变化,并且会在每个时间步读取。
Before beginning the simulation, we assume we have generated a grid of constants. Most of the entries in this grid are zero, but some entries contain nonzero temperatures that represent heaters at fixed temperatures. This buffer of constants will not change over the course of the simulation and gets read at each time step.
由于我们对传热进行建模的方式,我们从上一个时间步骤的输出网格开始。然后,根据步骤 1,我们将带有加热器的单元的温度复制到该输出网格中,覆盖之前计算的任何温度。我们这样做是因为我们假设这些加热器单元的温度保持恒定。我们使用以下内核将常量网格的副本复制到输入网格上:
Because of the way we are modeling our heat transfer, we start with the output grid from the previous time step. Then, according to step 1, we copy the temperatures of the cells with heaters into this output grid, overwriting any previously computed temperatures. We do this because we have assumed that the temperature of these heater cells remains constant. We perform this copy of the constant grid onto the input grid with the following kernel:
前三行应该看起来很熟悉。前两行将线程的threadIdx和转换blockIdx为x- 和y- 坐标。第三行计算offset常数和输入缓冲区的线性。突出显示的行将加热器温度复制到 中cptr[]的输入网格中iptr[]。请注意,仅当常量网格中的单元格非零时才执行复制。我们这样做是为了保留在不包含加热器的单元格内的上一个时间步中计算的任何值。带加热器的单元将具有非零条目cptr[],因此由于此复制内核,它们的温度将一步步保存。
The first three lines should look familiar. The first two lines convert a thread’s threadIdx and blockIdx into an x- and a y-coordinate. The third line computes a linear offset into our constant and input buffers. The highlighted line performs the copy of the heater temperature in cptr[] to the input grid in iptr[]. Notice that the copy is performed only if the cell in the constant grid is nonzero. We do this to preserve any values that were computed in the previous time step within cells that do not contain heaters. Cells with heaters will have nonzero entries in cptr[] and will therefore have their temperatures preserved from step to step thanks to this copy kernel.
该算法的第 2 步是计算量最大的。为了执行更新,我们可以让每个线程负责模拟中的单个单元。每个线程将读取其单元的温度及其相邻单元的温度,执行先前的更新计算,然后用新值更新其温度。该内核的大部分内容与您之前使用过的技术类似。
Step 2 of the algorithm is the most computationally involved. To perform the updates, we can have each thread take responsibility for a single cell in our simulation. Each thread will read its cell’s temperature and the temperatures of its neighboring cells, perform the previous update computation, and then update its temperature with the new value. Much of this kernel resembles techniques you’ve used before.
请注意,我们的开始与生成图像作为输出的示例完全相同。然而,线程不是计算像素的颜色,而是计算模拟网格单元的温度。尽管如此,他们首先将threadIdxand转换blockIdx为x, y, and offset。现在你也许可以在睡梦中背诵这些台词了(尽管为了你的缘故,我们希望你实际上并没有在睡梦中背诵它们)。
Notice that we start exactly as we did for the examples that produced images as their output. However, instead of computing the color of a pixel, the threads are computing temperatures of simulation grid cells. Nevertheless, they start by converting their threadIdx and blockIdx into an x, y, and offset. You might be able to recite these lines in your sleep by now (although for your sake, we hope you aren’t actually reciting them in your sleep).
接下来,我们确定左、右、上、下邻居的偏移量,以便我们可以读取这些单元的温度。我们将需要这些值来计算当前单元中的更新温度。这里唯一的复杂之处是我们需要调整边界上的索引,以便边缘周围的单元格不会环绕。最后,在突出显示的行中,我们执行公式 7.2的更新,添加旧温度以及该温度与单元邻居温度的比例差异。
Next, we determine the offsets of our left, right, top, and bottom neighbors so that we can read the temperatures of those cells. We will need those values to compute the updated temperature in the current cell. The only complication here is that we need to adjust indices on the border so that cells around the edges do not wrap around. Finally, in the highlighted line, we perform the update from Equation 7.2, adding the old temperature and the scaled differences of that temperature and the cell’s neighbors’ temperatures.
代码的其余部分主要设置网格,然后显示热图的动画输出。我们现在将演练该代码:
The remainder of the code primarily sets up the grid and then displays an animated output of the heat map. We will walk through that code now:
正如我们在上一章的光线追踪示例中所做的那样,我们为代码配备了基于事件的计时。计时代码的用途与之前相同。由于我们将努力加速初步实施,因此我们建立了一种机制,通过该机制我们可以衡量绩效并确信我们已经成功。
We have equipped the code with event-based timing as we did in the previous chapter’s ray tracing example. The timing code serves the same purpose as it did previously. Since we will endeavor to accelerate the initial implementation, we have put in place a mechanism by which we can measure performance and convince ourselves that we have succeeded.
anim_gpu()动画框架在每一帧上调用该函数。该函数的参数是指向 a 的指针和已播放的动画的DataBlock数量。ticks与动画示例一样,我们使用 256 个线程的块,将它们组织成 16 x 16 的二维网格。循环的每次迭代for()都会anim_gpu()计算模拟的单个时间步,如下面的三步算法所述。第 7.3.2 节开头:计算温度更新。由于DataBlock包含加热器的恒定缓冲区以及最后一个时间步的输出,因此它封装了动画的整个状态,因此anim_gpu()实际上不需要使用ticks任何地方的值。
The function anim_gpu() gets called by the animation framework on every frame. The arguments to this function are a pointer to a DataBlock and the number of ticks of the animation that have elapsed. As with the animation examples, we use blocks of 256 threads that we organize into a two-dimensional grid of 16 x 16. Each iteration of the for() loop in anim_gpu() computes a single time step of the simulation as described by the three-step algorithm at the beginning of Section 7.3.2: Computing Temperature Updates. Since the DataBlock contains the constant buffer of heaters as well as the output of the last time step, it encapsulates the entire state of the animation, and consequently, anim_gpu() does not actually need to use the value of ticks anywhere.
您会注意到我们选择每帧执行 90 个时间步。这个数字并不神奇,而是通过实验确定的,作为必须为每个时间步下载位图图像和每帧计算太多时间步(导致动画不稳定)之间的合理权衡。如果您更关心获取每个模拟步骤的输出而不是实时动画结果,则可以更改此设置,以便在每个帧上仅计算一个步骤。
You will notice that we have chosen to do 90 time steps per frame. This number is not magical but was determined somewhat experimentally as a reasonable trade-off between having to download a bitmap image for every time step and computing too many time steps per frame, resulting in a jerky animation. If you were more concerned with getting the output of each simulation step than you were with animating the results in real time, you could change this such that you computed only a single step on each frame.
计算自上一帧以来的 90 个时间步后,anim_gpu()准备将当前动画的位图帧复制回 CPU。由于for()循环离开输入和输出交换后,我们将输入缓冲区传递给下一个内核,其中实际上包含第 90 个时间步的输出。我们使用内核将温度转换为颜色float_to_color(),然后将生成的图像复制回 CPU,并将cudaMemcpy()复制方向指定为cudaMemcpyDeviceToHost。最后,为了准备下一个时间步骤序列,我们将输出缓冲区交换回输入缓冲区,因为它将作为下一个时间步骤的输入。
After computing the 90 time steps since the previous frame, anim_gpu() is ready to copy a bitmap frame of the current animation back to the CPU. Since the for() loop leaves the input and output swapped, we pass the input buffer to the next kernel,, which actually contains the output of the 90th time step. We convert the temperatures to colors using the kernel float_to_color() and then copy the resultant image back to the CPU with a cudaMemcpy() that specifies the direction of copy as cudaMemcpyDeviceToHost. Finally, to prepare for the next sequence of time steps, we swap the output buffer back to the input buffer since it will serve as input to the next time steps.
图 7.4显示了输出的示例。您会在图像中注意到一些“加热器”,它们看起来像是像素大小的岛屿,破坏了温度分布的连续性。
Figure 7.4 shows an example of what the output might look like. You will notice in the image some of the “heaters” that appear to be pixel-sized islands that disrupt the continuity of the temperature distribution.
Figure 7.4 A screenshot from the animated heat transfer simulation
在每个步骤中执行温度更新所需的存储器访问模式中存在大量的空间局部性。正如我们之前所解释的,这正是 GPU 纹理内存的访问模式类型旨在加速。鉴于我们想要使用纹理内存,我们需要学习这样做的机制。
There is a considerable amount of spatial locality in the memory access pattern required to perform the temperature update in each step. As we explained previously, this is exactly the type of access pattern that GPU texture memory is designed to accelerate. Given that we want to use texture memory, we need to learn the mechanics of doing so.
首先,我们需要将输入声明为纹理引用。我们将使用对浮点纹理的引用,因为我们的温度数据是浮点的。
First, we will need to declare our inputs as texture references. We will use references to floating-point textures, since our temperature data is floating-point.
下一个主要区别是,为这三个缓冲区分配 GPU 内存后,我们需要使用 .bind 将引用绑定到内存缓冲区cudaBindTexture()。这基本上告诉 CUDA 运行时两件事:
The next major difference is that after allocating GPU memory for these three buffers, we need to bind the references to the memory buffer using cudaBindTexture(). This basically tells the CUDA runtime two things:
• 我们打算使用指定的缓冲区作为纹理。
• We intend to use the specified buffer as a texture.
• 我们打算使用指定的纹理参考作为纹理的“名称”。
• We intend to use the specified texture reference as the texture’s “name.”
在传热模拟中的三个分配之后,我们将三个分配绑定到之前声明的纹理引用(texConstSrc、texIn和texOut)。
After the three allocations in our heat transfer simulation, we bind the three allocations to the texture references declared earlier (texConstSrc, texIn, and texOut).
此时,我们的纹理已完全设置完毕,我们已准备好启动内核。然而,当我们从内核中读取纹理时,我们需要使用特殊函数来指示 GPU 通过纹理单元而不是通过标准全局内存来路由我们的请求。因此,我们不能再简单地使用方括号从缓冲区中读取数据;我们需要修改为从内存读取时blend_kernel()使用。tex1Dfetch()
At this point, our textures are completely set up, and we’re ready to launch our kernel. However, when we’re reading from textures in the kernel, we need to use special functions to instruct the GPU to route our requests through the texture unit and not through standard global memory. As a result, we can no longer simply use square brackets to read from buffers; we need to modify blend_kernel() to use tex1Dfetch() when reading from memory.
此外,使用全局内存和纹理内存之间还有另一个区别,需要我们进行另一项更改。虽然它看起来像一个函数,但tex1Dfetch()它是编译器固有的。由于必须在文件范围内全局声明纹理引用,因此我们不能再将输入和输出缓冲区作为参数传递,blend_kernel()因为编译器需要在编译时知道tex1Dfetch()应该对哪些纹理进行采样。我们不会像以前那样将指针传递给输入和输出缓冲区,而是传递给blend_kernel()一个布尔标志dstOut,该标志指示要传递到哪个缓冲区用作输入以及用作输出。blend_kernel()此处突出显示了更改:
Additionally, there is another difference between using global and texture memory that requires us to make another change. Although it looks like a function, tex1Dfetch() is a compiler intrinsic. And since texture references must be declared globally at file scope, we can no longer pass the input and output buffers as parameters to blend_kernel() because the compiler needs to know at compile time which textures tex1Dfetch() should be sampling. Rather than passing pointers to input and output buffers as we previously did, we will pass to blend_kernel() a boolean flag dstOut that indicates which buffer to use as input and which to use as output. The changes to blend_kernel() are highlighted here:
由于copy_const_kernel()内核从保存加热器位置和温度的缓冲区中读取数据,因此我们需要进行类似的修改,以便读取纹理内存而不是全局内存:
Since the copy_const_kernel() kernel reads from our buffer that holds the heater positions and temperatures, we will need to make a similar modification there in order to read through texture memory instead of global memory:
由于 的签名blend_kernel()已更改为接受在输入和输出之间切换缓冲区的标志,因此我们需要对例程进行相应的更改anim_gpu()。我们不是交换缓冲区,而是设置dstOut = !dstOut在每一系列调用之后切换标志:
Since the signature of blend_kernel() changed to accept a flag that switches the buffers between input and output, we need a corresponding change to the anim_gpu() routine. Rather than swapping buffers, we set dstOut = !dstOut to toggle the flag after each series of calls:
我们传热例程的最后一个改变涉及在应用程序运行结束时进行清理。我们不仅需要释放全局缓冲区,还需要取消绑定纹理:
The final change to our heat transfer routine involves cleaning up at the end of the application’s run. Rather than just freeing the global buffers, we also need to unbind textures:
在本书的开头,我们提到有些问题具有二维域,因此有时使用二维块和网格会很方便。纹理内存也是如此。在很多情况下,二维内存区域都是有用的,对于任何熟悉标准 C 语言的多维数组的人来说,这种说法应该不会让人感到惊讶。让我们看看如何修改我们的传热应用程序以使用二维内存区域。纹理。
Toward the beginning of this book, we mentioned how some problems have two-dimensional domains, and therefore it can be convenient to use two-dimensional blocks and grids at times. The same is true for texture memory. There are many cases when having a two-dimensional memory region can be useful, a claim that should come as no surprise to anyone familiar with multidimensional arrays in standard C. Let’s look at how we can modify our heat transfer application to use two-dimensional textures.
首先,我们的纹理参考声明发生了变化。如果未指定,纹理引用默认是一维的,因此我们添加维度参数 2 来声明二维纹理。
First, our texture reference declarations change. If unspecified, texture references are one-dimensional by default, so we add a dimensionality argument of 2 in order to declare two-dimensional textures.
通过转换为二维纹理所承诺的简化来自于该blend_kernel()方法。虽然我们需要将tex1Dfetch()call改为tex2D()call,我们不再需要使用线性化变量来计算偏移量、、和 的offset集合。当我们切换到二维纹理时,我们可以直接使用and来寻址纹理。topleftrightbottomxy
The simplification promised by converting to two-dimensional textures comes in the blend_kernel() method. Although we need to change our tex1Dfetch() calls to tex2D() calls, we no longer need to use the linearized offset variable to compute the set of offsets top, left, right, and bottom. When we switch to a two-dimensional texture, we can use x and y directly to address the texture.
此外,当我们切换到使用时,我们不再需要担心边界溢出tex2D()。如果x或之一y小于零,tex2D()将返回零值。同样,如果这些值之一大于宽度,tex2D()将返回宽度为 1 的值。请注意,在我们的应用程序中,这种行为是理想的,但其他应用程序可能需要其他行为。
Furthermore, we no longer have to worry about bounds overflow when we switch to using tex2D(). If one of x or y is less than zero, tex2D() will return the value at zero. Likewise, if one of these values is greater than the width, tex2D() will return the value at width 1. Note that in our application, this behavior is ideal, but it’s possible that other applications would desire other behavior.
由于这些简化,我们的内核可以很好地清理。
As a result of these simplifications, our kernel cleans up nicely.
由于我们之前所有的调用都tex1Dfetch()需要改为tex2D()calls,所以我们在copy_const_kernel().与内核类似blend_kernel(),我们不再需要使用offset纹理来寻址;我们简单地使用xandy来寻址常量源:
Since all of our previous calls to tex1Dfetch() need to be changed to tex2D() calls, we make the corresponding change in copy_const_kernel(). Similarly to the kernel blend_kernel(), we no longer need to use offset to address the texture; we simply use x and y to address the constant source:
我们的传热模拟的一维纹理版本的最终更改与我们之前的更改相同。具体来说,在 中main(),我们需要更改纹理绑定调用,以指示运行时我们计划使用的缓冲区将被视为二维纹理,而不是一维纹理:
The final change to the one-dimensional texture version of our heat transfer simulation is along the same lines as our previous changes. Specifically, in main(), we need to change our texture binding calls to instruct the runtime that the buffer we plan to use will be treated as a two-dimensional texture, not a one-dimensional one:
与非纹理和一维纹理版本一样,我们首先为输入数组分配存储空间。我们偏离一维示例,因为 CUDA 运行时要求我们cudaChannelFormatDesc在绑定二维纹理时提供 a。前面的清单包括通道格式描述符的声明。在我们的例子中,我们可以接受默认参数,并且只需要指定我们需要一个浮点描述符。然后,我们使用cudaBindTexture2D()、纹理尺寸 ( DIMx DIM) 和通道格式描述符 ( desc)将三个输入缓冲区绑定为二维纹理。其余部分main()保持不变。
As with the nontexture and one-dimensional texture versions, we begin by allocating storage for our input arrays. We deviate from the one-dimensional example because the CUDA runtime requires that we provide a cudaChannelFormatDesc when we bind two-dimensional textures. The previous listing includes a declaration of a channel format descriptor. In our case, we can accept the default parameters and simply need to specify that we require a floating-point descriptor. We then bind the three input buffers as two-dimensional textures using cudaBindTexture2D(), the dimensions of the texture (DIM x DIM), and the channel format descriptor (desc). The rest of main() remains the same.
尽管我们需要不同的函数来指示运行时绑定一维或二维纹理,但我们使用相同的例程来解除纹理绑定cudaUnbindTexture()。正因为如此,我们的清理程序可以保持不变。
Although we needed different functions to instruct the runtime to bind one-dimensional or two-dimensional textures, we use the same routine to unbind the texture, cudaUnbindTexture(). Because of this, our cleanup routine can remain unchanged.
使用二维纹理的传热模拟版本与使用一维纹理的版本具有基本相同的性能特征。因此,从性能的角度来看,一维纹理和二维纹理之间的决定可能并不重要。对于我们的特定应用程序,使用二维纹理时代码会稍微简单一些,因为我们碰巧模拟的是二维域。但一般来说,由于情况并非总是如此,我们建议您根据具体情况在一维和二维纹理之间做出决定。
The version of our heat transfer simulation that uses two-dimensional textures has essentially identical performance characteristics as the version that uses one-dimensional textures. So from a performance standpoint, the decision between one- and two-dimensional textures is likely to be inconsequential. For our particular application, the code is a little simpler when using two-dimensional textures because we happen to be simulating a two-dimensional domain. But in general, since this is not always the case, we suggest you make the decision between one- and two-dimensional textures on a case-by-case basis.
正如我们在前一章中看到的恒定内存,纹理内存的一些好处来自片上缓存。这在诸如我们的传热模拟之类的应用程序中尤其明显:其数据访问模式具有一定空间一致性的应用程序。我们了解了如何使用一维或二维纹理,两者都具有相似的性能特征。与块或网格形状一样,选择一维或二维纹理主要是为了方便。由于当我们切换到二维纹理并且自动处理边界时代码变得更加清晰,因此我们可能会提倡在传热应用程序中使用 2D 纹理。但正如您所看到的,无论哪种方式都可以正常工作。
As we saw in the previous chapter with constant memory, some of the benefit of texture memory comes as the result of on-chip caching. This is especially noticeable in applications such as our heat transfer simulation: applications that have some spatial coherence to their data access patterns. We saw how either one- or two-dimensional textures can be used, both having similar performance characteristics. As with a block or grid shape, the choice of one- or two-dimensional texture is largely one of convenience. Since the code became somewhat cleaner when we switched to two-dimensional textures and the borders are handled automatically, we would probably advocate the use of a 2D texture in our heat transfer application. But as you saw, it will work fine either way.
如果我们利用纹理采样器可以自动执行的一些转换,例如将打包数据解包到单独的变量中或将 8 位和 16 位整数转换为标准化浮点数,纹理内存可以提供额外的加速。我们没有在传热应用中探索这些功能,但它们可能对您有用!
Texture memory can provide additional speedups if we utilize some of the conversions that texture samplers can perform automatically, such as unpacking packed data into separate variables or converting 8- and 16-bit integers to normalized floating-point numbers. We didn’t explore either of these capabilities in the heat transfer application, but they might be useful to you!
由于本书重点关注通用计算,因此我们在很大程度上忽略了 GPU 也包含一些专用组件。 GPU 的成功归功于其实时执行复杂渲染任务的能力,从而使系统的其余部分能够专注于其他工作。这引出了一个显而易见的问题:我们可以在同一应用程序中使用 GPU 进行渲染和通用计算吗?如果我们想要渲染的图像依赖于我们的计算结果怎么办?或者,如果我们想要获取已渲染的帧并对其执行一些图像处理或统计计算,该怎么办?
Since this book has focused on general-purpose computation, for the most part we’ve ignored that GPUs contain some special-purpose components as well. The GPU owes its success to its ability to perform complex rendering tasks in real time, freeing the rest of the system to concentrate on other work. This leads us to the obvious question: Can we use the GPU for both rendering and general-purpose computation in the same application? What if the images we want to render rely on the results of our computations? Or what if we want to take the frame we’ve rendered and perform some image-processing or statistics computations on it?
幸运的是,通用计算和渲染模式之间的这种交互不仅是可能的,而且根据您已知的知识,它相当容易完成。 CUDA C 应用程序可以与两种最流行的实时渲染 API(OpenGL 和 DirectX)无缝互操作。本章将介绍启用此功能的机制。
Fortunately, not only is this interaction between general-purpose computation and rendering modes possible, but it’s fairly easy to accomplish given what you already know. CUDA C applications can seamlessly interoperate with either of the two most popular real-time rendering APIs, OpenGL and DirectX. This chapter will look at the mechanics by which you can enable this functionality.
本章中的示例与我们在前几章中设置的先例有所不同。特别是,本章假设了您大量的其他技术背景。具体来说,我们在这些示例中包含了大量的 OpenGL 和 GLUT 代码,我们几乎不会深入解释其中的任何代码。有许多学习图形 API 的极好资源,无论是在网上还是在书店,但这些主题远远超出了知识范围。本书的预期范围。相反,本章旨在重点介绍 CUDA C 及其提供的将其合并到图形应用程序中的工具。如果您不熟悉 OpenGL 或 DirectX,您不太可能从本章中获得太多好处,并且可能想跳到下一章。
The examples in this chapter deviate some from the precedents we’ve set in previous chapters. In particular, this chapter assumes a significant amount about your background with other technologies. Specifically, we have included a considerable amount of OpenGL and GLUT code in these examples, almost none of which will we explain in great depth. There are many superb resources to learn graphics APIs, both online and in bookstores, but these topics are well beyond the intended scope of this book. Rather, this chapter intends to focus on CUDA C and the facilities it offers to incorporate it into your graphics applications. If you are unfamiliar with OpenGL or DirectX, you are unlikely to derive much benefit from this chapter and may want to skip to the next.
通过本章的课程,您将完成以下任务:
Through the course of this chapter, you will accomplish the following:
• 您将了解什么是图形互操作性以及为什么可以使用它。
• You will learn what graphics interoperability is and why you might use it.
• 您将学习如何设置CUDA 设备以实现图形互操作性。
• You will learn how to set up a CUDA device for graphics interoperability.
• 您将学习如何在CUDA C 内核和OpenGL 渲染之间共享数据。
• You will learn how to share data between your CUDA C kernels and OpenGL rendering.
为了演示图形和 CUDA C 之间的互操作机制,我们将编写一个分两步运行的应用程序。第一步使用 CUDA C 内核生成图像数据。第二步,应用程序将此数据传递给 OpenGL 驱动程序进行渲染。为了实现这一点,我们将使用我们在前面的章节中看到的大部分 CUDA C 以及一些 OpenGL 和 GLUT 调用。
To demonstrate the mechanics of interoperation between graphics and CUDA C, we’ll write an application that works in two steps. The first step uses a CUDA C kernel to generate image data. In the second step, the application passes this data to the OpenGL driver to render. To accomplish this, we will use much of the CUDA C we have seen in previous chapters along with some OpenGL and GLUT calls.
为了启动我们的应用程序,我们包含相关的 GLUT 和 CUDA 标头,以确保定义正确的函数和枚举。我们还定义了应用程序计划渲染的窗口的大小。在 512 x 512 像素下,我们将绘制相对较小的绘图。
To start our application, we include the relevant GLUT and CUDA headers in order to ensure the correct functions and enumerations are defined. We also define the size of the window into which our application plans to render. At 512 x 512 pixels, we will do relatively small drawings.
此外,我们声明了两个全局变量,用于存储我们打算在 OpenGL 和数据之间共享的数据的句柄。我们很快就会看到如何使用这两个变量,但它们会将不同的句柄存储到同一缓冲区。我们需要两个单独的变量,因为 OpenGL 和 CUDA 的缓冲区都有不同的“名称”。该变量bufferObj将是数据的 OpenGL 名称,该变量resource将是其 CUDA C 名称。
Additionally, we declare two global variables that will store handles to the data we intend to share between OpenGL and data. We will see momentarily how we use these two variables, but they will store different handles to the same buffer. We need two separate variables because OpenGL and CUDA will both have different “names” for the buffer. The variable bufferObj will be OpenGL’s name for the data, and the variable resource will be the CUDA C name for it.
现在我们来看看实际的应用。我们要做的第一件事是选择一个 CUDA 设备来运行我们的应用程序。在许多系统上,这并不是一个复杂的过程,因为它们通常只包含一个支持 CUDA 的 GPU。然而,越来越多的系统包含多个支持 CUDA 的 GPU,因此我们需要一种方法来选择一个。幸运的是,CUDA运行时为我们提供了这样的工具。
Now let’s take a look at the actual application. The first thing we do is select a CUDA device on which to run our application. On many systems, this is not a complicated process, since they will often contain only a single CUDA-enabled GPU. However, an increasing number of systems contain more than one CUDA-enabled GPU, so we need a method to choose one. Fortunately, the CUDA runtime provides such a facility to us.
您可能还记得我们cudaChooseDevice()在第 3 章中看到过,但由于它是一个辅助点,因此我们现在将再次回顾它。本质上,此代码告诉运行时选择任何具有1.0 版或更高版本计算能力的 GPU。它通过首先创建和清除cudaDeviceProp结构,然后将其major版本设置为 1 并将minor版本设置为 0 来实现此目的。它将此信息传递给cudaChooseDevice(),指示运行时在系统中选择满足结构指定的约束的 GPU cudaDeviceProp。在下一章中,我们将更多地了解 GPU 的计算能力是什么意思,但现在可以说它大致指示了 GPU 支持的功能。所有支持 CUDA 的 GPU 至少具有计算能力 1.0,因此此调用的最终效果是运行时将选择任何支持 CUDA 的设备并在变量 中返回该设备的标识符dev。没有保证该设备是最好或最快的 GPU,也不保证该设备在不同版本的 CUDA 运行时都是相同的 GPU。
You may recall that we saw cudaChooseDevice() in Chapter 3, but since it was something of an ancillary point, we’ll review it again now. Essentially, this code tells the runtime to select any GPU that has a compute capability of version 1.0 or better. It accomplishes this by first creating and clearing a cudaDeviceProp structure and then by setting its major version to 1 and minor version to 0. It passes this information to cudaChooseDevice(), which instructs the runtime to select a GPU in the system that satisfies the constraints specified by the cudaDeviceProp structure. In the next chapter, we will look more at what is meant by a GPU’s compute capability, but for now it suffices to say that it roughly indicates the features a GPU supports. All CUDA-capable GPUs have at least compute capability 1.0, so the net effect of this call is that the runtime will select any CUDA-capable device and return an identifier for this device in the variable dev. There is no guarantee that this device is the best or fastest GPU, nor is there a guarantee that the device will be the same GPU from version to version of the CUDA runtime.
如果设备选择的结果看起来如此令人印象深刻,为什么我们要费尽心思来填充cudaDeviceProp结构并调用cudaChooseDevice()以获取有效的设备 ID?此外,我们以前从未为这种愚蠢的行为烦恼过,那么为什么现在呢?这些都是好问题。事实证明,我们需要知道 CUDA 设备 ID,以便我们可以告诉 CUDA 运行时我们打算将该设备用于 CUDA和OpenGL。我们通过调用 来实现这一点,传递我们从 获得的cudaGLSetGLDevice()设备 ID :devcudaChooseDevice()
If the result of device selection is so seemingly underwhelming, why do we bother with all this effort to fill a cudaDeviceProp structure and call cudaChooseDevice() to get a valid device ID? Furthermore, we never hassled with this tomfoolery before, so why now? These are good questions. It turns out that we need to know the CUDA device ID so that we can tell the CUDA runtime that we intend to use the device for CUDA and OpenGL. We achieve this with a call to cudaGLSetGLDevice(), passing the device ID dev we obtained from cudaChooseDevice():
CUDA 运行时初始化后,我们可以通过调用 GL Utility Toolkit (GLUT) 设置函数来初始化 OpenGL 驱动程序。如果您以前使用过 GLUT,那么这个调用序列应该看起来相对熟悉:
After the CUDA runtime initialization, we can proceed to initialize the OpenGL driver by calling our GL Utility Toolkit (GLUT) setup functions. This sequence of calls should look relatively familiar if you’ve used GLUT before:
此时main(),我们已经准备好了 CUDA 运行时,以便通过调用 来与 OpenGL 驱动程序很好地配合cudaGLSetGLDevice()。然后我们初始化 GLUT 并创建一个名为“bitmap”的窗口,在其中绘制我们的结果。现在我们可以开始实际的 OpenGL 互操作了!
At this point in main(), we’ve prepared our CUDA runtime to play nicely with the OpenGL driver by calling cudaGLSetGLDevice(). Then we initialized GLUT and created a window named “bitmap” in which to draw our results. Now we can get on to the actual OpenGL interoperation!
共享数据缓冲区是 CUDA C 内核和 OpenGL 渲染之间互操作的关键组件。为了在 OpenGL 和 CUDA 之间传递数据,我们首先需要创建一个可以与这两个 API 一起使用的缓冲区。我们通过在 OpenGL 中创建一个像素缓冲区对象并将句柄存储在全局变量中来开始此过程GLuint bufferObj:
Shared data buffers are the key component to interoperation between CUDA C kernels and OpenGL rendering. To pass data between OpenGL and CUDA, we will first need to create a buffer that can be used with both APIs. We start this process by creating a pixel buffer object in OpenGL and storing the handle in our global variable GLuint bufferObj:
如果您从未在 OpenGL 中使用过像素缓冲区对象 (PBO),您通常会通过以下三个步骤创建一个:首先,我们使用glGenBuffers().然后,我们将句柄绑定到像素缓冲区glBindBuffer()。最后,我们请求 OpenGL 驱动程序为我们分配一个缓冲区glBufferData()。在此示例中,我们请求一个缓冲区来保存DIMx DIM32 位值,并使用枚举GL_DYNAMIC_DRAW_ARB来指示该缓冲区将被应用程序重复修改。由于我们没有数据可以预加载缓冲区,因此我们将其NULL作为倒数第二个参数传递给glBufferData().
If you have never used a pixel buffer object (PBO) in OpenGL, you will typically create one with these three steps: First, we generate a buffer handle with glGenBuffers(). Then, we bind the handle to a pixel buffer with glBindBuffer(). Finally, we request the OpenGL driver to allocate a buffer for us with glBufferData(). In this example, we request a buffer to hold DIM x DIM 32-bit values and use the enumerant GL_DYNAMIC_DRAW_ARB to indicate that the buffer will be modified repeatedly by the application. Since we have no data to preload the buffer with, we pass NULL as the penultimate argument to glBufferData().
我们寻求建立图形互操作性的过程中剩下的就是通知 CUDA 运行时我们打算共享bufferObj以 CUDA 命名的 OpenGL 缓冲区。我们通过bufferObj将 CUDA 运行时注册为图形资源来实现此目的。
All that remains in our quest to set up graphics interoperability is notifying the CUDA runtime that we intend to share the OpenGL buffer named bufferObj with CUDA. We do this by registering bufferObj with the CUDA runtime as a graphics resource.
我们通过调用 向 CUDA 运行时指定我们打算将 OpenGL PBObufferObj与 OpenGL 和 CUDA 一起使用cudaGraphicsGLRegisterBuffer()。 CUDA 运行时将 CUDA 友好的句柄返回到变量中的缓冲区resource。该句柄将用于bufferObj在后续调用 CUDA 运行时时进行引用。
We specify to the CUDA runtime that we intend to use the OpenGL PBO bufferObj with both OpenGL and CUDA by calling cudaGraphicsGLRegisterBuffer(). The CUDA runtime returns a CUDA-friendly handle to the buffer in the variable resource. This handle will be used to refer to bufferObj in subsequent calls to the CUDA runtime.
该标志cudaGraphicsMapFlagsNone指定该缓冲区没有我们想要指定的特定行为,尽管我们可以选择指定cudaGraphicsMapFlagsReadOnly该缓冲区为只读。我们还可以用来cudaGraphicsMapFlagsWriteDiscard指定以前的内容将被丢弃,使缓冲区本质上是只写的。这些标志允许 CUDA 和 OpenGL 驱动程序优化具有受限访问模式的缓冲区的硬件设置,尽管不需要设置它们。
The flag cudaGraphicsMapFlagsNone specifies that there is no particular behavior of this buffer that we want to specify, although we have the option to specify with cudaGraphicsMapFlagsReadOnly that the buffer will be readonly. We could also use cudaGraphicsMapFlagsWriteDiscard to specify that the previous contents will be discarded, making the buffer essentially write-only. These flags allow the CUDA and OpenGL drivers to optimize the hardware settings for buffers with restricted access patterns, although they are not required to be set.
实际上,对 的调用请求glBufferData()OpenGL 驱动程序分配一个足够大的缓冲区来容纳DIMx DIM32 位值。在后续的 OpenGL 调用中,我们将使用句柄引用该缓冲区bufferObj,而在 CUDA 运行时调用中,我们将使用指针引用该缓冲区resource。由于我们希望从 CUDA C 内核读取和写入此缓冲区,因此我们需要的不仅仅是对象的句柄。我们需要设备内存中的一个实际地址,可以是传递给我们的内核。我们通过指示 CUDA 运行时映射共享资源,然后请求指向映射资源的指针来实现此目的。
Effectively, the call to glBufferData() requests the OpenGL driver to allocate a buffer large enough to hold DIM x DIM 32-bit values. In subsequent OpenGL calls, we’ll refer to this buffer with the handle bufferObj, while in CUDA runtime calls, we’ll refer to this buffer with the pointer resource. Since we would like to read from and write to this buffer from our CUDA C kernels, we will need more than just a handle to the object. We will need an actual address in device memory that can be passed to our kernel. We achieve this by instructing the CUDA runtime to map the shared resource and then by requesting a pointer to the mapped resource.
然后我们可以devPtr像使用任何设备指针一样使用,除了数据也可以被 OpenGL 用作像素源。在所有这些设置恶作剧之后,其余的main()进行如下:首先,我们启动内核,将其传递给我们的共享缓冲区的指针。这个内核(我们还没有看到其代码)生成要渲染的图像数据。接下来,我们取消共享资源的映射。在执行渲染任务之前进行此调用非常重要,因为它提供了应用程序的 CUDA 和图形部分之间的同步。具体来说,这意味着在调用之前执行的所有 CUDA 操作cudaGraphicsUnmapResources()将在随后的图形调用开始之前完成。
We can then use devPtr as we would use any device pointer, except that the data can also be used by OpenGL as a pixel source. After all these setup shenanigans, the rest of main() proceeds as follows: First, we launch our kernel, passing it the pointer to our shared buffer. This kernel, the code of which we have not seen yet, generates image data to be rendered. Next, we unmap our shared resource. This call is important to make prior to performing rendering tasks because it provides synchronization between the CUDA and graphics portions of the application. Specifically, it implies that all CUDA operations performed prior to the call to cudaGraphicsUnmapResources() will complete before ensuing graphics calls begin.
key_func最后,我们使用 GLUT (和)注册键盘和显示回调函数draw_func,并使用 GLUT 放弃对 GLUT 渲染循环的控制glutMainLoop()。
Lastly, we register our keyboard and display callback functions with GLUT (key_func and draw_func), and we relinquish control to the GLUT rendering loop with glutMainLoop().
该应用程序的其余部分由我们刚刚突出显示的三个函数组成:kernel()、key_func()和draw_func()。那么,让我们来看看这些。
The remainder of the application consists of the three functions we just highlighted, kernel(), key_func(), and draw_func(). So, let’s take a look at those.
内核函数采用设备指针并生成图像数据。在下面的示例中,我们使用的内核受到第 5 章中的ripple 示例的启发:
The kernel function takes a device pointer and generates image data. In the following example, we’re using a kernel inspired by the ripple example in Chapter 5:
许多熟悉的概念都在这里发挥作用。将线程和块索引转换为x- 和- 坐标以及线性偏移的方法y已被多次检查。然后,我们执行一些相当任意的计算来确定该位置像素的颜色(x,y),并将这些值存储到内存中。我们再次使用 CUDA C 在 GPU 上按程序生成图像。需要意识到的重要一点是,该图像将直接交给OpenGL 进行渲染,而无需 CPU 参与。另一方面,在第 5 章的ripple 示例中,我们在 GPU 上生成图像数据非常像这样,但我们的应用程序随后将缓冲区复制回 CPU 进行显示。
Many familiar concepts are at work here. The method for turning thread and block indices into x- and y-coordinates and a linear offset has been examined several times. We then perform some reasonably arbitrary computations to determine the color for the pixel at that (x,y) location, and we store those values to memory. We’re again using CUDA C to procedurally generate an image on the GPU. The important thing to realize is that this image will then be handed directly to OpenGL for rendering without the CPU ever getting involved. On the other hand, in the ripple example of Chapter 5, we generated image data on the GPU very much like this, but our application then copied the buffer back to the CPU for display.
那么,我们如何使用 OpenGL 绘制 CUDA 生成的缓冲区呢?好吧,如果您还记得我们在 中执行的设置main(),您就会记住以下内容:
So, how do we draw the CUDA-generated buffer using OpenGL? Well, if you recall the setup we performed in main(), you’ll remember the following:
glBindBuffer( GL_PIXEL_UNPACK_BUFFER_ARB, bufferObj );
glBindBuffer( GL_PIXEL_UNPACK_BUFFER_ARB, bufferObj );
此调用将共享缓冲区绑定为像素源,供 OpenGL 驱动程序在所有后续调用中使用glDrawPixels()。本质上,这意味着我们只需要调用glDrawPixels()即可渲染 CUDA C 内核生成的图像数据。因此,我们draw_func()需要做的就是:
This call bound the shared buffer as a pixel source for the OpenGL driver to use in all subsequent calls to glDrawPixels(). Essentially, this means that a call to glDrawPixels() is all that we need in order to render the image data our CUDA C kernel generated. Consequently, the following is all that our draw_func() needs to do:
您可能已经看到过glDrawPixels()将缓冲区指针作为最后一个参数。如果没有缓冲区绑定为GL_PIXEL_UNPACK_BUFFER_ARB源,则 OpenGL 驱动程序将从该缓冲区进行复制。然而,由于我们的数据已经在 GPU 上,并且我们已将共享缓冲区绑定为GL_PIXEL_UNPACK_BUFFER_ARB源,因此最后一个参数将成为绑定缓冲区的偏移量。因为我们想要渲染整个缓冲区,所以对于我们的应用程序来说,这个偏移量为零。
It’s possible you’ve seen glDrawPixels() with a buffer pointer as the last argument. The OpenGL driver will copy from this buffer if no buffer is bound as a GL_PIXEL_UNPACK_BUFFER_ARB source. However, since our data is already on the GPU and we have bound our shared buffer as the GL_PIXEL_UNPACK_BUFFER_ARB source, this last parameter instead becomes an offset into the bound buffer. Because we want to render the entire buffer, this offset is zero for our application.
此示例的最后一个组件似乎有些虎头蛇尾,但我们决定为用户提供退出应用程序的方法。在这种情况下,我们的key_func()回调仅响应 Esc 键,并将其用作清理和退出的信号:
The last component to this example seems somewhat anticlimactic, but we’ve decided to give our users a method to exit the application. In this vein, our key_func() callback responds only to the Esc key and uses this as a signal to clean up and exit:
图8.1催眠图形互操作示例截图
Figure 8.1 A screenshot of the hypnotic graphics interoperation example
运行时,该示例以“NVIDIA Green”和黑色绘制了一幅令人着迷的图片,如图8.1所示。尝试用它来催眠你的朋友(或敌人)。
When run, this example draws a mesmerizing picture in “NVIDIA Green” and black, shown in Figure 8.1. Try using it to hypnotize your friends (or enemies).
在“第 8.2 节:图形互操作”中,我们多次提到了第 5 章的 GPU 纹波示例。如果您还记得,该应用程序创建了一个CPUAnimBitmap并向其传递了一个函数,以便在需要生成帧时调用该函数。
In “Section 8.2: Graphics Interoperation,” we referred to Chapter 5’s GPU ripple example a few times. If you recall, that application created a CPUAnimBitmap and passed it a function to be called whenever a frame needed to be generated.
利用我们在上一节中学到的技术,我们打算创建一个GPUAnimBitmap结构。该结构的用途与 相同CPUAnimBitmap,但在这个改进版本中,CUDA 和 OpenGL 组件将在无需 CPU 干预的情况下进行协作。完成后,应用程序将使用 a GPUAnimBitmap,因此main()将变得简单如下:
With the techniques we’ve learned in the previous section, we intend to create a GPUAnimBitmap structure. This structure will serve the same purpose as the CPUAnimBitmap, but in this improved version, the CUDA and OpenGL components will cooperate without CPU intervention. When we’re done, the application will use a GPUAnimBitmap so that main() will become simply as follows:
该结构使用我们刚刚在第 8.2 节:图形互操作GPUAnimBitmap中检查过的相同调用。然而,现在这些调用将被抽象成一个结构,以便将来的示例(以及可能是您自己的应用程序)将更加清晰。GPUAnimBitmap
The GPUAnimBitmap structure uses the same calls we just examined in Section 8.2: Graphics Interoperation. However, now these calls will be abstracted away in a GPUAnimBitmap structure so that future examples (and potentially your own applications) will be cleaner.
我们的一些数据成员在第 8.2 节:图形互操作GPUAnimBitmap中看起来很熟悉。
Several of the data members for our GPUAnimBitmap will look familiar to you from Section 8.2: Graphics Interoperation.
我们知道 OpenGL 和 CUDA 运行时对 GPU 缓冲区有不同的名称,并且我们知道我们需要引用这两个名称,具体取决于我们是进行 OpenGL 还是 CUDA C 调用。因此,我们的结构将存储 OpenGL 的bufferObj名称和 CUDA 运行时的resource名称。由于我们正在处理要显示的位图图像,因此我们知道该图像将具有宽度和高度。
We know that OpenGL and the CUDA runtime will have different names for our GPU buffer, and we know that we will need to refer to both of these names, depending on whether we are making OpenGL or CUDA C calls. Therefore, our structure will store both OpenGL’s bufferObj name and the CUDA runtime’s resource name. Since we are dealing with a bitmap image that we intend to display, we know that the image will have a width and height to it.
为了允许我们的用户GPUAnimBitmap注册某些回调事件,我们还将void*在 中存储指向任意用户数据的指针dataBlock。我们的结构永远不会查看这些数据,而只是将其传递回任何已注册的回调函数。用户可以注册的回调存储在fAnim、animExit和中clickDrag。fAnim()每次调用 时都会调用该函数glutIdleFunc(),该函数负责生成将在动画中渲染的图像数据。animExit()当动画退出时,该函数将被调用一次。这是用户应该实现动画结束时需要执行的清理代码的地方。最后,clickDrag()一个可选函数,实现用户对鼠标单击/拖动事件的响应。如果用户注册此函数,则在每次按下鼠标按钮、拖动和释放事件序列后都会调用该函数。该序列中初始鼠标单击的位置存储在 中(dragStartX,dragStartY)以便在释放鼠标按钮时可以将单击/拖动事件的起点和终点传递给用户。这可用于实现给您的朋友留下深刻印象的交互式动画。
To allow users of our GPUAnimBitmap to register for certain callback events, we will also store a void* pointer to arbitrary user data in dataBlock. Our structure will never look at this data but will simply pass it back to any registered callback functions. The callbacks that a user may register are stored in fAnim, animExit, and clickDrag. The function fAnim() gets called in every call to glutIdleFunc(), and this function is responsible for producing the image data that will be rendered in the animation. The function animExit() will be called once, when the animation exits. This is where the user should implement cleanup code that needs to be executed when the animation ends. Finally, clickDrag(), an optional function, implements the user’s response to mouse click/drag events. If the user registers this function, it gets called after every sequence of mouse button press, drag, and release events. The location of the initial mouse click in this sequence is stored in (dragStartX, dragStartY) so that the start and endpoints of the click/drag event can be passed to the user when the mouse button is released. This can be used to implement interactive animations that will impress your friends.
初始化 aGPUAnimBitmap遵循我们在前面的示例中看到的相同代码序列。将参数存储在适当的结构成员中后,我们首先查询 CUDA 运行时以获取合适的 CUDA 设备:
Initializing a GPUAnimBitmap follows the same sequence of code that we saw in our previous example. After stashing away arguments in the appropriate structure members, we start by querying the CUDA runtime for a suitable CUDA device:
找到兼容的 CUDA 设备后,我们cudaGLSetGLDevice()对 CUDA 运行时进行重要调用,以通知它我们打算用作dev与 OpenGL 互操作的设备:
After finding a compatible CUDA device, we make the important cudaGLSetGLDevice() call to the CUDA runtime in order to notify it that we intend to use dev as a device for interoperation with OpenGL:
由于我们的框架使用 GLUT 来创建窗口渲染环境,因此我们需要初始化 GLUT。不幸的是,这有点尴尬,因为glutInit()需要将命令行参数传递给窗口系统。由于我们没有想要传递的参数,因此我们只想指定零个命令行参数。不幸的是,某些版本的 GLUT 有一个错误,当给出零参数时,该错误会导致应用程序崩溃。所以,我们欺骗 GLUT 让它认为我们正在传递一个参数,结果,生活是美好的。
Since our framework uses GLUT to create a windowed rendering environment, we need to initialize GLUT. This is unfortunately a bit awkward, since glutInit() wants command-line arguments to pass to the windowing system. Since we have none we want to pass, we would like to simply specify zero command-line arguments. Unfortunately, some versions of GLUT have a bug that causes applications to crash when zero arguments are given. So, we trick GLUT into thinking that we’re passing an argument, and as a result, life is good.
我们继续初始化 GLUT,就像我们在前面的示例中所做的那样。我们创建一个要在其中渲染的窗口,并使用字符串“bitmap”指定标题。如果您想将您的窗口命名为更有趣的名称,请成为我们的客人。
We continue initializing GLUT exactly as we did in the previous example. We create a window in which to render, specifying a title with the string “bitmap.” If you’d like to name your window something more interesting, be our guest.
接下来,我们请求 OpenGL 驱动程序分配一个缓冲区句柄,我们立即将其绑定到目标GL_PIXEL_UNPACK_BUFFER_ARB,以确保将来的调用glDrawPixels()将绘制到我们的互操作缓冲区:
Next, we request for the OpenGL driver to allocate a buffer handle that we immediately bind to the GL_PIXEL_UNPACK_BUFFER_ARB target to ensure that future calls to glDrawPixels() will draw to our interop buffer:
最后,但同样重要的是,我们请求 OpenGL 驱动程序为我们分配一个 GPU 内存区域。完成此操作后,我们会通知 CUDA 运行时该缓冲区,并通过注册来请求该缓冲区的 CUDA CbufferObj名称cudaGraphicsGLRegisterBuffer()。
Last, but most certainly not least, we request that the OpenGL driver allocate a region of GPU memory for us. Once this is done, we inform the CUDA runtime of this buffer and request a CUDA C name for this buffer by registering bufferObj with cudaGraphicsGLRegisterBuffer().
设置完成后GPUAnimBitmap,唯一剩下的问题就是我们如何执行渲染。渲染的主要内容将在我们的glutIdleFunc().这个函数本质上会做三件事。首先,它映射我们的共享缓冲区并检索该缓冲区的 GPU 指针。
With the GPUAnimBitmap set up, the only remaining concern is exactly how we perform the rendering. The meat of the rendering will be done in our glutIdleFunc(). This function will essentially do three things. First, it maps our shared buffer and retrieves a GPU pointer for this buffer.
其次,它调用用户指定的函数,该函数可能会启动 CUDA C 内核以用图像数据fAnim()填充缓冲区。devPtr
Second, it calls the user-specified function fAnim() that presumably will launch a CUDA C kernel to fill the buffer at devPtr with image data.
最后,它取消映射 GPU 指针,从而释放缓冲区以供 OpenGL 驱动程序在渲染时使用。此渲染将通过调用 来触发glutPostRedisplay()。
And lastly, it unmaps the GPU pointer that will release the buffer for use by the OpenGL driver in rendering. This rendering will be triggered by a call to glutPostRedisplay().
该结构的其余部分GPUAnimBitmap由重要但有些无关的基础设施代码组成。如果你对它感兴趣,你一定应该检查一下。但我们认为,即使您没有时间或兴趣来消化GPUAnimBitmap.
The remainder of the GPUAnimBitmap structure consists of important but somewhat tangential infrastructure code. If you have an interest in it, you should by all means examine it. But we feel that you’ll be able to proceed successfully, even if you lack the time or interest to digest the rest of the code in GPUAnimBitmap.
现在我们有了 GPU 版本CPUAnimBitmap,我们可以继续改进 GPU 波纹应用程序,以完全在 GPU 上执行其动画。首先,我们将包括gpu_anim.h,我们实施的家GPUAnimBitmap。我们还包含与第 5 章中讨论的几乎相同的内核。
Now that we have a GPU version of CPUAnimBitmap, we can proceed to retrofit our GPU ripple application to perform its animation entirely on the GPU. To begin, we will include gpu_anim.h, the home of our implementation of GPUAnimBitmap. We also include nearly the same kernel as we examined in Chapter 5.
我们所做的唯一一项更改已突出显示。进行此更改的原因是因为 OpenGL 互操作要求我们的共享表面是“图形友好的”。因为实时渲染通常使用四分量(红/绿/蓝/alpha)数据元素的数组,所以我们的目标缓冲区不再unsigned char像以前那样简单地是一个数组。现在要求它是一个类型的数组uchar4。实际上,我们将第 5 章中的缓冲区视为四分量缓冲区,因此我们总是用 对其进行索引ptr[offset*4+k],其中k表示从 0 到 3 的分量。但是现在,通过 switch 使数据的四分量性质变得明确到一个uchar4类型。
The one and only change we’ve made is highlighted. The reason for this change is because OpenGL interoperation requires that our shared surfaces be “graphics friendly.” Because real-time rendering typically uses arrays of four-component (red/green/blue/alpha) data elements, our target buffer is no longer simply an array of unsigned char as it previously was. It’s now required to be an array of type uchar4. In reality, we treated our buffer in Chapter 5 as a four-component buffer, so we always indexed it with ptr[offset*4+k], where k indicates the component from 0 to 3. But now, the four-component nature of the data is made explicit with the switch to a uchar4 type.
由于是一个生成图像数据的 CUDA C 函数,因此剩下的就是编写一个主机函数,该函数kernel()将用作.对于我们当前的应用程序,该函数所做的只是启动 CUDA C 内核:idle_func()GPUAnimBitmap
Since kernel() is a CUDA C function that generates image data, all that remains is writing a host function that will be used as a callback in the idle_func() member of GPUAnimBitmap. For our current application, all this function does is launch the CUDA C kernel:
这基本上就是我们需要的一切,因为所有繁重的工作都是在GPUAnimBitmap结构中完成的。为了开始这个聚会,我们只需创建GPUAnimBitmap并注册我们的动画回调函数generate_frame()。
That’s basically everything we need, since all of the heavy lifting was done in the GPUAnimBitmap structure. To get this party started, we just create a GPUAnimBitmap and register our animation callback function, generate_frame().
那么,做这一切的意义何在?如果您查看我们在之前的动画示例中使用的结构的内部结构,我们会发现它的工作方式几乎与第 8.2 节CPUAnimBitmap:图形互操作中的渲染代码完全相同。
So, what has been the point of doing all of this? If you look at the internals of the CPUAnimBitmap, the structure we used for previous animation examples, we would see that it works almost exactly like the rendering code in Section 8.2: Graphics Interoperation.
几乎。
Almost.
该示例与前面的示例之间的主要区别CPUAnimBitmap在于对 的调用glDrawPixels()。
The key difference between the CPUAnimBitmap and the previous example is buried in the call to glDrawPixels().
我们在本章的第一个示例中指出,您以前可能见过使用glDrawPixels()缓冲区指针作为最后一个参数的调用。好吧,如果你以前没有,那么现在你已经有了。Draw()例程中的此调用CPUAnimBitmap会触发将 CPU 缓冲区复制到bitmap->pixelsGPU 中进行渲染。为此,CPU 需要停止正在执行的操作,并为每一帧启动到 GPU 的复制。这需要 CPU 和 GPU 之间的同步以及额外的延迟来启动和完成 PCI Express 总线上的传输。由于对glDrawPixels()最后一个参数的调用需要主机指针,这也意味着在使用 CUDA C 内核生成图像数据帧后,我们第 5 章的ripple 应用程序需要使用cudaMemcpy().
We remarked in the first example of this chapter that you may have previously seen calls to glDrawPixels() with a buffer pointer as the last argument. Well, if you hadn’t before, you have now. This call in the Draw() routine of CPUAnimBitmap triggers a copy of the CPU buffer in bitmap->pixels to the GPU for rendering. To do this, the CPU needs to stop what it’s doing and initiate a copy onto the GPU for every frame. This requires synchronization between the CPU and GPU and additional latency to initiate and complete a transfer over the PCI Express bus. Since the call to glDrawPixels() expects a host pointer in the last argument, this also means that after generating a frame of image data with a CUDA C kernel, our Chapter 5 ripple application needed to copy the frame from the GPU to the CPU with a cudaMemcpy().
总而言之,这些事实意味着我们最初的 GPU 波纹应用程序有点愚蠢。我们使用 CUDA C 计算每帧渲染的图像值,但计算完成后,我们将缓冲区复制到 CPU,然后 CPU 将缓冲区复制回GPU 进行显示。这意味着我们在主机和设备之间引入了不必要的数据传输,这阻碍了我们实现最佳性能。让我们重新审视一个计算密集型动画应用程序,通过将其迁移为使用图形互操作进行渲染,可能会提高其性能。
Taken together, these facts mean that our original GPU ripple application was more than a little silly. We used CUDA C to compute image values for our rendering in each frame, but after the computations were done, we copied the buffer to the CPU, which then copied the buffer back to the GPU for display. This means that we introduced unnecessary data transfers between the host and the device that stood between us and maximum performance. Let’s revisit a compute-intensive animation application that might see its performance improve by migrating it to use graphics interoperation for its rendering.
如果您还记得上一章的热模拟应用程序,您会记得它还用于CPUAnimBitmap显示其模拟计算的输出。我们将修改此应用程序以使用我们新实现的GPUAnimBitmap结构,并查看结果性能如何变化。与 Ripple 示例一样,ourGPUAnimBitmap几乎是 的完美替代品CPUAnimBitmap,除了unsigned chartouchar4更改之外。因此,我们的动画例程的签名会发生变化,以适应数据类型的这种变化。
If you recall the previous chapter’s heat simulation application, you will remember that it also used CPUAnimBitmap in order to display the output of its simulation computations. We will modify this application to use our newly implemented GPUAnimBitmap structure and look at how the resulting performance changes. As with the ripple example, our GPUAnimBitmap is almost a perfect drop-in replacement for CPUAnimBitmap, with the exception of the unsigned char to uchar4 change. So, the signature of our animation routine changes in order to accommodate this shift in data types.
由于float_to_color()内核是唯一实际使用 的函数outputBitmap,因此它是由于我们转向 而需要修改的唯一其他函数uchar4。该函数在前一章中被简单地视为实用程序代码,我们将继续将其视为实用程序代码。然而,我们已经重载了这个函数,并将unsigned char和uchar4版本都包含在book.h.您会注意到,这些函数之间的差异与kernel()GPU 波纹的 CPU 动画和 GPU 动画版本之间的差异相同。为了清楚起见,内核的大部分代码float_to_color()已被省略,但book.h如果您渴望查看详细信息,我们鼓励您进行咨询。
Since the float_to_color() kernel is the only function that actually uses the outputBitmap, it’s the only other function that needs modification as a result of our shift to uchar4. This function was simply considered utility code in the previous chapter, and we will continue to consider it utility code. However, we have overloaded this function and included both unsigned char and uchar4 versions in book.h. You will notice that the differences between these functions are identical to the differences between kernel() in the CPU-animated and GPU-animated versions of GPU ripple. Most of the code for the float_to_color() kernels has been omitted for clarity, but we encourage you to consult book.h if you’re dying to see the details.
除了这些更改之外,唯一的主要区别在于从 更改CPUAnimBitmap为GPUAnimBitmap执行动画。
Outside of these changes, the only major difference is in the change from CPUAnimBitmap to GPUAnimBitmap to perform animation.
尽管浏览一下这个增强型热模拟应用程序的其余部分可能会有所启发,但它与前一章的版本没有足够的不同,因此不需要进行更多描述。重要的是回答这个问题:既然我们已经将应用程序完全迁移到 GPU,性能会发生怎样的变化?不用把每一帧都复制回主机显示,情况应该比以前要幸福很多。
Although it might be instructive to take a glance at the rest of this enhanced heat simulation application, it is not sufficiently different from the previous chapter’s version to warrant more description. The important component is answering the question, how does performance change now that we’ve completely migrated the application to the GPU? Without having to copy every frame back to the host for display, the situation should be much happier than it was previously.
那么,使用图形互操作性来执行渲染到底有多好呢?此前,热传递示例在我们基于 GeForce GTX 285 的测试机上每帧消耗约 25.3 毫秒。将应用程序转换为使用图形互操作性后,每帧时间下降了 15%,达到 21.6 毫秒。最终结果是我们的渲染循环速度提高了 15%,并且每次我们想要显示帧时不再需要主机干预。这对于一天的工作来说还不错!
So, exactly how much better is it to use the graphics interoperability to perform the rendering? Previously, the heat transfer example consumed about 25.3ms per frame on our GeForce GTX 285–based test machine. After converting the application to use graphics interoperability, this drops by 15 percent to 21.6ms per frame. The net result is that our rendering loop is 15 percent faster and no longer requires intervention from the host every time we want to display a frame. That’s not bad for a day’s work!
尽管我们只查看了与 OpenGL 渲染系统使用互操作的示例,但 DirectX 互操作几乎是相同的。您仍将使用 acudaGraphicsResource来引用在 DirectX 和 CUDA 之间共享的缓冲区,并且仍将使用对这些共享资源的调用cudaGraphicsMapResources()和cudaGraphicsResourceGetMappedPointer()检索 CUDA 友好的指针。
Although we’ve looked only at examples that use interoperation with the OpenGL rendering system, DirectX interoperation is nearly identical. You will still use a cudaGraphicsResource to refer to buffers that you share between DirectX and CUDA, and you will still use calls to cudaGraphicsMapResources() and cudaGraphicsResourceGetMappedPointer() to retrieve CUDA-friendly pointers to these shared resources.
大多数情况下,OpenGL 和 DirectX 互操作性之间不同的调用对 DirectX 的转换极其简单。例如,cudaGLSetGLDevice()我们不是调用 ,而是调用cudaD3D9SetDirect3DDevice()来指定应启用 CUDA 设备以实现 Direct3D 9.0 互操作性。同样,cudaD3D10SetDirect3DDevice()启用设备进行 Direct3D 10 互操作和cudaD3D11SetDirect3DDevice()Direct3D 11。
For the most part, the calls that differ between OpenGL and DirectX interoperability have embarrassingly simple translations to DirectX. For example, rather than calling cudaGLSetGLDevice(), we call cudaD3D9SetDirect3DDevice() to specify that a CUDA device should be enabled for Direct3D 9.0 interoperability. Likewise, cudaD3D10SetDirect3DDevice() enables a device for Direct3D 10 interoperation and cudaD3D11SetDirect3DDevice() for Direct3D 11.
如果您已经学习过本章的 OpenGL 示例,那么 DirectX 互操作性的细节可能不会令您感到惊讶。但如果您想使用 DirectX 互操作并希望启动一个小项目,我们建议您迁移本章的示例以使用 DirectX。首先,我们建议您查阅NVIDIA CUDA 编程指南以获取有关 API 的参考,并查看有关 DirectX 互操作性的 GPU 计算 SDK 代码示例。
The details of DirectX interoperability probably will not surprise you if you’ve worked through this chapter’s OpenGL examples. But if you want to use DirectX interoperation and want a small project to get started, we suggest that you migrate this chapter’s examples to use DirectX. To get started, we recommend consulting the NVIDIA CUDA Programming Guide for a reference on the API and taking a look at the GPU Computing SDK code samples on DirectX interoperability.
尽管本书的大部分内容都致力于使用 GPU 进行并行、通用计算,但我们不能忘记 GPU 作为渲染引擎的成功日常工作。许多应用程序需要或将从使用标准计算机图形渲染中受益。由于 GPU 是渲染领域的主宰,因此我们在利用这些资源时遇到的障碍就是缺乏对说服 CUDA 运行时和图形驱动程序合作的机制的理解。现在我们已经了解了这是如何完成的,我们不再需要主机干预显示计算的图形结果。这同时加速了应用程序的渲染循环,并释放主机以同时执行其他计算。否则,如果没有其他计算要执行,它会让我们的系统对其他事件或应用程序更加敏感。
Although much of this book has been devoted to using the GPU for parallel, general-purpose computing, we can’t forget the GPU’s successful day job as a rendering engine. Many applications require or would benefit from the use of standard computer graphics rendering. Since the GPU is master of the rendering domain, all that stood between us and the exploitation of these resources was a lack of understanding of the mechanics in convincing the CUDA runtime and graphics drivers to cooperate. Now that we have seen how this is done, we no longer need the host to intervene in displaying the graphical results of our computations. This simultaneously accelerates the application’s rendering loop and frees the host to perform other computations in the meantime. Otherwise, if there are no other computations to be performed, it leaves our system more responsive to other events or applications.
还有许多其他我们尚未探索的使用图形互操作性的方法。我们主要研究使用 CUDA C 内核写入像素缓冲区对象以在窗口中显示。该图像数据还可以用作纹理,应用于场景中的任何表面。除了修改像素缓冲区对象之外,您还可以在 CUDA 和图形引擎之间共享顶点缓冲区对象。除此之外,您还可以编写 CUDA C 内核来执行对象之间的碰撞检测或计算顶点位移图,以用于渲染与用户或其周围环境交互的对象或表面。如果您对计算机图形感兴趣,CUDA C 的图形互操作性 API 可为您的应用程序带来大量新的可能性!
There are many other ways to use graphics interoperability that we left unexplored. We looked primarily at using a CUDA C kernel to write into a pixel buffer object for display in a window. This image data can also be used as a texture that can be applied to any surface in the scene. In addition to modifying pixel buffer objects, you can also share vertex buffer objects between CUDA and the graphics engine. Among other things, this allows you to write CUDA C kernels that perform collision detection between objects or compute vertex displacement maps to be used to render objects or surfaces that interact with the user or their surroundings. If you’re interested in computer graphics, CUDA C’s graphics interoperability API enables a slew of new possibilities for your applications!
在本书的前半部分,我们看到很多情况下,使用单线程应用程序完成的复杂任务在使用 CUDA C 实现时变得非常容易。例如,由于 CUDA 运行时的幕后工作,我们不再需要for()循环来在我们的动画或热模拟中进行逐像素更新。同样,只需__global__从主机代码调用函数即可创建数千个并行块和线程,并使用线程和块索引自动枚举。
In the first half of the book, we saw many occasions where something complicated to accomplish with a single-threaded application becomes quite easy when implemented using CUDA C. For example, thanks to the behind-the-scenes work of the CUDA runtime, we no longer needed for() loops in order to do per-pixel updates in our animations or heat simulations. Likewise, thousands of parallel blocks and threads get created and automatically enumerated with thread and block indices simply by calling a __global__ function from host code.
另一方面,在某些情况下,当我们尝试在大规模并行架构上实现相同的算法时,单线程应用程序中极其简单的事情实际上会带来严重的问题。在本章中,我们将讨论一些需要使用特殊原语的情况,以便安全地完成在传统的单线程应用程序中非常琐碎的事情。
On the other hand, there are some situations where something incredibly simple in single-threaded applications actually presents a serious problem when we try to implement the same algorithm on a massively parallel architecture. In this chapter, we’ll take a look at some of the situations where we need to use special primitives in order to safely accomplish things that can be quite trivial to do in a traditional, single-threaded application.
通过本章的课程,您将完成以下任务:
Through the course of this chapter, you will accomplish the following:
• 您将了解各种NVIDIA GPU 的计算能力。
• You will learn about the compute capability of various NVIDIA GPUs.
• 您将了解什么是原子操作以及为什么您可能需要它们。
• You will learn about what atomic operations are and why you might need them.
• 您将学习如何在CUDA C 内核中使用原子操作执行算术运算。
• You will learn how to perform arithmetic with atomic operations in your CUDA C kernels.
到目前为止,我们讨论的所有主题都涉及每个支持 CUDA 的 GPU 所拥有的功能。例如,每个基于 CUDA 架构构建的 GPU 都可以启动内核、访问全局内存以及从常量内存和纹理内存中读取。但就像不同型号的 CPU 具有不同的功能和指令集(例如,MMX、SSE 或 SSE2)一样,支持 CUDA 的图形处理器也是如此。 NVIDIA 将 GPU 支持的功能称为其计算能力。
All of the topics we have covered to this point involve capabilities that every CUDA-enabled GPU possesses. For example, every GPU built on the CUDA Architecture can launch kernels, access global memory, and read from constant and texture memories. But just like different models of CPUs have varying capabilities and instruction sets (for example, MMX, SSE, or SSE2), so too do CUDA-enabled graphics processors. NVIDIA refers to the supported features of a GPU as its compute capability.
截至发稿时,NVIDIA GPU 可能支持计算能力 1.0、1.1、1.2、1.3 或 2.0。更高功能的版本代表低于其版本的超集,实现“分层洋葱”或“俄罗斯嵌套娃娃”层次结构(取决于您的隐喻偏好)。例如,计算能力1.2的GPU支持计算能力1.0和1.1的所有功能。NVIDIA CUDA 编程指南包含所有支持 CUDA 的 GPU 及其相应计算能力的最新列表。表 9.1列出了截至发稿时可用的 NVIDIA GPU。每个 GPU 支持的计算能力列在设备名称旁边。
As of press time, NVIDIA GPUs could potentially support compute capabilities 1.0, 1.1, 1.2, 1.3, or 2.0. Higher-capability versions represent supersets of the versions below them, implementing a “layered onion” or “Russian nesting doll” hierarchy (depending on your metaphorical preference). For example, a GPU with compute capability 1.2 supports all the features of compute capabilities 1.0 and 1.1. The NVIDIA CUDA Programming Guide contains an up-to-date list of all CUDA-capable GPUs and their corresponding compute capability. Table 9.1 lists the NVIDIA GPUs available at press time. The compute capability supported by each GPU is listed next to the device’s name.
表 9.1选定的支持 CUDA 的 GPU 及其相应的计算能力
Table 9.1 Selected CUDA-Enabled GPUs and Their Corresponding Compute Capabilities
当然,由于NVIDIA一直在发布新的图形处理器,因此当本书出版时,这张表无疑已经过时了。幸运的是,NVIDIA有一个网站,在这个网站上你会找到CUDA专区。除此之外,CUDA 区域还提供最新的受支持 CUDA 设备列表。由于无法在表 9.1中找到您的新 GPU,我们建议您在采取任何重大行动之前先查阅此列表。或者您可以简单地运行第 3 章中的示例,打印系统中每个 CUDA 设备的计算能力。
Of course, since NVIDIA releases new graphics processors all the time, this table will undoubtedly be out-of-date the moment this book is published. Fortunately, NVIDIA has a website, and on this website you will find the CUDA Zone. Among other things, the CUDA Zone is home to the most up-to-date list of supported CUDA devices. We recommend that you consult this list before doing anything drastic as a result of being unable to find your new GPU in Table 9.1. Or you can simply run the example from Chapter 3 that prints the compute capability of each CUDA device in the system.
因为这是关于原子的章节,所以特别相关的是在内存上执行原子操作的硬件能力。在我们了解什么是原子操作以及您关心的原因之前,您应该知道全局内存上的原子操作仅在计算能力 1.1 或更高的 GPU 上受支持。此外,共享内存上的原子操作需要计算能力 1.2 或更高的 GPU。由于计算能力版本的超集性质,计算能力 1.2 的 GPU 因此支持共享内存原子和全局内存原子。同样,计算能力 1.3 的 GPU 也支持这两者。
Because this is the chapter on atomics, of particular relevance is the hardware capability to perform atomic operations on memory. Before we look at what atomic operations are and why you care, you should know that atomic operations on global memory are supported only on GPUs of compute capability 1.1 or higher. Furthermore, atomic operations on shared memory require a GPU of compute capability 1.2 or higher. Because of the superset nature of compute capability versions, GPUs of compute capability 1.2 therefore support both shared memory atomics and global memory atomics. Similarly, GPUs of compute capability 1.3 support both of these as well.
如果事实证明您的 GPU 具有计算能力 1.0 并且不支持全局内存上的原子操作,那么也许我们刚刚给了您升级的完美借口!如果您决定不准备花钱购买新的支持原子的图形处理器,您可以继续阅读有关原子操作以及您可能想要使用它们的情况。但是,如果您发现无法运行这些示例太令人心碎,请随意跳到下一章。
If it turns out that your GPU is of compute capability 1.0 and it doesn’t support atomic operations on global memory, well maybe we’ve just given you the perfect excuse to upgrade! If you decide you’re not ready to splurge on a new atomics-enabled graphics processor, you can continue to read about atomic operations and the situations in which you might want to use them. But if you find it too heartbreaking that you won’t be able to run the examples, feel free to skip to the next chapter.
假设我们编写的代码需要一定的最低计算能力。例如,假设您已完成本章并开始编写一个严重依赖全局内存原子的应用程序。广泛研究本文后,您知道全局内存原子需要 1.1 的计算能力。要编译代码,您需要通知编译器内核无法在功能低于 1.1 的硬件上运行。此外,在告诉编译器这一点时,您还可以自由地进行其他优化,这些优化可能仅在计算能力 1.1 或更高的 GPU 上可用。通知其编译器就像在调用中添加命令行选项一样简单nvcc:
Suppose that we have written code that requires a certain minimum compute capability. For example, imagine that you’ve finished this chapter and go off to write an application that relies heavily on global memory atomics. Having studied this text extensively, you know that global memory atomics require a compute capability of 1.1. To compile your code, you need to inform the compiler that the kernel cannot run on hardware with a capability less than 1.1. Moreover, in telling the compiler this, you’re also giving it the freedom to make other optimizations that may be available only on GPUs of compute capability 1.1 or greater. Informing the compiler of this is as simple as adding a command-line option to your invocation of nvcc:
nvcc-arch=sm_11
nvcc -arch=sm_11
同样,要构建依赖于共享内存原子的内核,您需要通知编译器该代码需要计算能力 1.2 或更高:
Similarly, to build a kernel that relies on shared memory atomics, you need to inform the compiler that the code requires compute capability 1.2 or greater:
nvcc-arch=sm_12
nvcc -arch=sm_12
程序员在编写传统的单线程应用程序时通常不需要使用原子操作。如果您遇到这种情况,请不要担心;我们计划解释它们是什么以及为什么我们在多线程应用程序中可能需要它们。为了阐明原子操作,我们将看看您在学习 C 或 C++ 时首先学到的东西之一:增量运算符:
Programmers typically never need to use atomic operations when writing traditional single-threaded applications. If this is the situation with you, don’t worry; we plan to explain what they are and why we might need them in a multithreaded application. To clarify atomic operations, we’ll look at one of the first things you learned when learning C or C++, the increment operator:
x++;
x++;
这是标准 C 中的单个表达式,执行该表达式后, in 的值x应比执行增量之前的值大 1。但这意味着什么操作顺序呢?要将 的值加一x,我们首先需要知道当前的值是什么x。读取到 的值后x,我们就可以修改它了。最后,我们需要将此值写回x。
This is a single expression in standard C, and after executing this expression, the value in x should be one greater than it was prior to executing the increment. But what sequence of operations does this imply? To add one to the value of x, we first need to know what value is currently in x. After reading the value of x, we can modify it. And finally, we need to write this value back to x.
所以这个操作的三个步骤如下:
So the three steps in this operation are as follows:
1. 读取 中的值x。
1. Read the value in x.
2. 将步骤 1 中读取的值加 1。
2. Add 1 to the value read in step 1.
3. 将结果写回x。
3. Write the result back to x.
有时,此过程通常称为读取-修改-写入操作,因为步骤 2 可以包含任何更改从 读取的值的操作x。
Sometimes, this process is generally called a read-modify-write operation, since step 2 can consist of any operation that changes the value that was read from x.
现在考虑这样一种情况,两个线程需要对 中的值执行此增量x。我们将这些线程称为A和B。为了使A和B都增加 中的值x,两个线程都需要执行我们描述的三个操作。假设x从值 7 开始。理想情况下,我们希望线程A和线程执行表 9.2B中所示的步骤。
Now consider a situation where two threads need to perform this increment on the value in x. Let’s call these threads A and B. For A and B to both increment the value in x, both threads need to perform the three operations we’ve described. Let’s suppose x starts with the value 7. Ideally we would like thread A and thread B to do the steps shown in Table 9.2.
Table 9.2 Two threads incrementing the value in x
由于x从值 7 开始并通过两个线程递增,因此我们期望它在线程完成后保持值 9。在前面的一系列操作中,我们得到的确实是这样的结果。不幸的是,这些步骤的许多其他顺序会产生错误的值。例如,考虑表 9.3中所示的顺序,其中线程 A 和线程 B 的操作彼此交错。
Since x starts with the value 7 and gets incremented by two threads, we would expect it to hold the value 9 after they’ve completed. In the previous sequence of operations, this is indeed the result we obtain. Unfortunately, there are many other orderings of these steps that produce the wrong value. For example, consider the ordering shown in Table 9.3 where thread A and thread B’s operations become interleaved with each other.
Table 9.3 Two threads incrementing the value in x with interleaved operations
因此,如果我们的线程调度不顺利,我们最终会计算出错误的结果。这六种操作还有许多其他顺序,其中一些会产生正确的结果,而另一些则不会。当从该应用程序的单线程版本迁移到多线程版本时,如果多个线程需要读取或写入共享值,我们可能会突然出现不可预测的结果。
Therefore, if our threads get scheduled unfavorably, we end up computing the wrong result. There are many other orderings for these six operations, some of which produce correct results and some of which do not. When moving from a single-threaded to a multithreaded version of this application, we suddenly have potential for unpredictable results if multiple threads need to read or write shared values.
在前面的示例中,我们需要一种方法来执行读取-修改-写入操作,而不会被另一个线程中断。或者更具体地说,x在我们完成操作之前,没有其他线程可以读取或写入 的值。因为这些操作的执行不能被其他线程分成更小的部分,所以我们将满足此约束的操作称为原子操作。 CUDA C 支持多种原子操作,即使有数千个线程可能竞争访问权限,您也可以安全地在内存上进行操作。
In the previous example, we need a way to perform the read-modify-write without being interrupted by another thread. Or more specifically, no other thread can read or write the value of x until we have completed our operation. Because the execution of these operations cannot be broken into smaller parts by other threads, we call operations that satisfy this constraint as atomic. CUDA C supports several atomic operations that allow you to operate safely on memory, even when thousands of threads are potentially competing for access.
现在我们将看一个需要使用原子操作来计算正确结果的示例。
Now we’ll take a look at an example that requires the use of atomic operations to compute correct results.
通常,算法需要计算某些数据集的直方图。如果您过去没有任何使用直方图的经验,那也没什么大不了的。本质上,给定由某些元素集组成的数据集,直方图表示每个元素的频率计数。例如,如果我们创建短语“ Programming with CUDA C”中字母的直方图,我们最终会得到如图 9.1所示的结果。
Oftentimes, algorithms require the computation of a histogram of some set of data. If you haven’t had any experience with histograms in the past, that’s not a big deal. Essentially, given a data set that consists of some set of elements, a histogram represents a count of the frequency of each element. For example, if we created a histogram of the letters in the phrase Programming with CUDA C, we would end up with the result shown in Figure 9.1.
图 9.1使用 CUDA C 编程从字符串构建的字母频率直方图
Figure 9.1 Letter frequency histogram built from the string Programming with CUDA C
尽管描述和理解很简单,但数据直方图的计算在计算机科学中却经常出现。它用于图像处理、数据压缩、计算机视觉、机器学习、音频编码等许多算法。我们将使用直方图计算作为以下代码示例的算法。
Although simple to describe and understand, computing histograms of data arises surprisingly often in computer science. It’s used in algorithms for image processing, data compression, computer vision, machine learning, audio encoding, and many others. We will use histogram computation as the algorithm for the following code examples.
由于直方图的计算可能并非所有读者都熟悉,因此我们将从如何在 CPU 上计算直方图的示例开始。此示例还将说明在单线程 CPU 应用程序中计算直方图是如何相对简单的。应用程序将获得一些大的数据流。在实际应用程序中,数据可能表示从像素颜色到音频样本的任何内容,但在我们的示例应用程序中,它将是随机生成的字节流。我们可以使用我们提供的名为 的实用函数来创建这个随机字节流big_random_block()。在我们的应用程序中,我们创建 100MB 的随机数据。
Because the computation of a histogram may not be familiar to all readers, we’ll start with an example of how to compute a histogram on the CPU. This example will also serve to illustrate how computing a histogram is relatively simple in a single-threaded CPU application. The application will be given some large stream of data. In an actual application, the data might signify anything from pixel colors to audio samples, but in our sample application, it will be a stream of randomly generated bytes. We can create this random stream of bytes using a utility function we have provided called big_random_block(). In our application, we create 100MB of random data.
由于每个随机 8 位字节可以是 256 个不同值(从0x00到0xFF)中的任意一个,因此我们的直方图需要包含 256 个bin,以便跟踪每个值在数据中出现的次数。我们创建一个 256-bin 数组并将所有 bin 计数初始化为零。
Since each random 8-bit byte can be any of 256 different values (from 0x00 to 0xFF), our histogram needs to contain 256 bins in order to keep track of the number of times each value has been seen in the data. We create a 256-bin array and initialize all the bin counts to zero.
创建直方图并将所有 bin 初始化为零后,我们需要将每个值出现在 中包含的数据中的频率制成表格buffer[]。这里的想法是,每当我们看到z数组中的某个值时,我们都希望增加直方图buffer[]bin 中的值。z这样,我们就可以计算出现该值的次数z。
Once our histogram has been created and all the bins are initialized to zero, we need to tabulate the frequency with which each value appears in the data contained in buffer[]. The idea here is that whenever we see some value z in the array buffer[], we want to increment the value in bin z of our histogram. This way, we’re counting the number of times we have seen an occurrence of the value z.
如果buffer[i]是我们正在查看的当前值,我们希望增加编号为 的容器中的计数buffer[i]。由于 binbuffer[i]位于histo[buffer[i]],我们可以在一行代码中增加适当的计数器。
If buffer[i] is the current value we are looking at, we want to increment the count we have in the bin numbered buffer[i]. Since bin buffer[i] is located at histo[buffer[i]], we can increment the appropriate counter in a single line of code.
histo[缓冲区[i]]++;
histo[buffer[i]]++;
我们buffer[]通过一个简单的for()循环对每个元素执行此操作:
We do this for each element in buffer[] with a simple for() loop:
至此,我们已经完成了输入数据的直方图。在完整的应用程序中,该直方图可能是下一步计算的输入。然而,在我们的简单示例中,这就是我们关心的所有计算,因此我们通过验证直方图的所有箱总和是否达到预期值来结束应用程序。
At this point, we’ve completed our histogram of the input data. In a full application, this histogram might be the input to the next step of computation. In our simple example, however, this is all we care to compute, so we end the application by verifying that all the bins of our histogram sum to the expected value.
如果您仔细观察,您会发现无论随机输入数组如何,这个总和总是相同的。每个 bin 都会计算我们看到相应数据元素的次数,因此所有这些 bin 的总和应该是我们检查过的数据元素的总数。在我们的例子中,这将是值SIZE。
If you’ve followed closely, you will realize that this sum will always be the same, regardless of the random input array. Each bin counts the number of times we have seen the corresponding data element, so the sum of all of these bins should be the total number of data elements we’ve examined. In our case, this will be the value SIZE.
不用说(但无论如何我们都会),我们会自己清理干净并返回。
And needless to say (but we will anyway), we clean up after ourselves and return.
在我们的基准机器 Core 2 Duo 上,这个 100MB 数据数组的直方图可以在 0.416 秒内构建出来。这将为我们打算编写的 GPU 版本提供基准性能。
On our benchmark machine, a Core 2 Duo, the histogram of this 100MB array of data can be constructed in 0.416 seconds. This will provide a baseline performance for the GPU version we intend to write.
我们希望调整直方图计算示例以在 GPU 上运行。如果我们的输入数组足够大,那么让不同的线程检查缓冲区的不同部分可能会节省大量时间。让不同的线程读取输入的不同部分应该很容易。毕竟,它与我们迄今为止看到的事情非常相似。从输入数据计算直方图的问题源于以下事实:多个线程可能希望同时递增输出直方图的同一区间。在这种情况下,我们需要使用原子增量来避免出现第 9.3 节:原子操作概述中描述的情况。
We would like to adapt the histogram computation example to run on the GPU. If our input array is large enough, it might save a considerable amount of time to have different threads examining different parts of the buffer. Having different threads read different parts of the input should be easy enough. After all, it’s very similar to things we have seen so far. The problem with computing a histogram from the input data arises from the fact that multiple threads may want to increment the same bin of the output histogram at the same time. In this situation, we will need to use atomic increments to avoid a situation like the one described in Section 9.3: Atomic Operations Overview.
我们的main()例程看起来与 CPU 版本非常相似,尽管我们需要添加一些 CUDA C 管道才能获取 GPU 的输入并从 GPU 获取结果。然而,我们的启动与在 CPU 上的启动完全一样:
Our main() routine looks very similar to the CPU version, although we will need to add some of the CUDA C plumbing in order to get input to the GPU and results from the GPU. However, we start exactly as we did on the CPU:
我们会对测量代码的执行情况感兴趣,因此我们像往常一样初始化事件以进行计时。
We will be interested in measuring how our code performs, so we initialize events for timing exactly like we always have.
设置输入数据和事件后,我们查看 GPU 内存。我们需要为随机输入数据和输出直方图分配空间。分配输入缓冲区后,我们复制生成的数组big_random_block()到 GPU。同样,分配直方图后,我们将其初始化为零,就像在 CPU 版本中所做的那样。
After setting up our input data and events, we look to GPU memory. We will need to allocate space for our random input data and our output histogram. After allocating the input buffer, we copy the array we generated with big_random_block() to the GPU. Likewise, after allocating the histogram, we initialize it to zero just like we did in the CPU version.
您可能会注意到,我们引入了一个新的 CUDA 运行时函数cudaMemset().该函数与标准 C 函数具有相似的签名memset(),并且这两个函数的行为几乎相同。这些函数之间的签名差异在于cudaMemset()返回错误代码,而 C 库函数memset()则不返回错误代码。此错误代码将通知调用者在尝试设置 GPU 内存时是否发生了任何错误。除了错误代码返回之外,唯一的区别是cudaMemset()在 GPU 内存上操作而memset()在主机内存上操作。
You may notice that we slipped in a new CUDA runtime function, cudaMemset(). This function has a similar signature to the standard C function memset(), and the two functions behave nearly identically. The difference in signature between these functions is that cudaMemset() returns an error code while the C library function memset() does not. This error code will inform the caller whether anything bad happened while attempting to set GPU memory. Aside from the error code return, the only difference is that cudaMemset() operates on GPU memory while memset() operates on host memory.
初始化输入和输出缓冲区后,我们就可以计算直方图了。您将立即看到我们如何准备和启动直方图内核。暂时假设我们已经在 GPU 上计算了直方图。完成后,我们需要将直方图复制回CPU,因此我们分配一个256条目的数组并执行从设备到主机的复制。
After initializing the input and output buffers, we are ready to compute our histogram. You will see how we prepare and launch the histogram kernel momentarily. For the time being, assume that we have computed the histogram on the GPU. After finishing, we need to copy the histogram back to the CPU, so we allocate a 256-entry array and perform a copy from device to host.
此时,我们已经完成了直方图计算,因此我们可以停止计时器并显示经过的时间。就像前面的事件代码一样,这与我们在几章中使用的计时代码相同。
At this point, we are done with the histogram computation so we can stop our timers and display the elapsed time. Just like the previous event code, this is identical to the timing code we’ve used for several chapters.
此时,我们可以将直方图作为输入传递到算法中的另一个阶段,但由于我们没有将直方图用于其他任何用途,因此我们将简单地验证计算出的 GPU 直方图与我们在 CPU 上获得的直方图是否匹配。首先,我们验证直方图总和是否符合我们的预期。这与此处显示的 CPU 代码相同:
At this point, we could pass the histogram as input to another stage in the algorithm, but since we are not using the histogram for anything else, we will simply verify that the computed GPU histogram matches what we get on the CPU. First, we verify that the histogram sum matches what we expect. This is identical to the CPU code shown here:
不过,为了完全验证 GPU 直方图,我们将使用 CPU 来计算相同的直方图。执行此操作的明显方法是分配一个新的直方图数组,使用第 9.4.1 节:CPU 直方图计算中的代码根据输入计算直方图,最后确保 GPU 和 CPU 版本中的每个 bin 匹配。但我们不会分配新的直方图数组,而是选择从 GPU 直方图开始,“反向”计算 CPU 直方图。
To fully verify the GPU histogram, though, we will use the CPU to compute the same histogram. The obvious way to do this would be to allocate a new histogram array, compute a histogram from the input using the code from Section 9.4.1: CPU Histogram Computation, and, finally, ensure that each bin in the GPU and CPU version match. But rather than allocate a new histogram array, we’ll opt to start with the GPU histogram and compute the CPU histogram “in reverse.”
通过“反向”计算直方图,我们的意思是,我们不是从零开始并在看到数据元素时递增 bin 值,而是从 GPU 直方图开始并在 CPU 看到数据元素时递减bin 值。因此,当且仅当我们完成时每个 bin 的值为零时,CPU 才计算出与 GPU 相同的直方图。从某种意义上说,我们正在计算这两个直方图之间的差异。代码看起来与 CPU 直方图计算非常相似,但使用减量运算符而不是增量运算符。
By computing the histogram “in reverse,” we mean that rather than starting at zero and incrementing bin values when we see data elements, we will start with the GPU histogram and decrement the bin’s value when the CPU sees data elements. Therefore, the CPU has computed the same histogram as the GPU if and only if every bin has the value zero when we are finished. In some sense, we are computing the difference between these two histograms. The code will look remarkably like the CPU histogram computation but with a decrement operator instead of an increment operator.
像往常一样,最后涉及清理我们分配的 CUDA 事件、GPU 内存和主机内存。
As usual, the finale involves cleaning up our allocated CUDA events, GPU memory, and host memory.
之前,我们假设我们已经启动了一个计算直方图的内核,然后继续讨论后果。由于性能问题,我们的内核启动比平时稍微复杂一些。由于直方图包含 256 个 bin,因此每个块使用 256 个线程非常方便,并且可以实现高性能。但我们在启动的区块数量方面有很大的灵活性。例如,对于 100MB 的数据,我们有 104,857,600 字节的数据。我们可以启动一个块并让每个线程检查 409,600 个数据元素。同样,我们可以启动 409,600 个块并让每个线程检查单个数据元素。
Before, we assumed that we had launched a kernel that computed our histogram and then pressed on to discuss the aftermath. Our kernel launch is slightly more complicated than usual because of performance concerns. Because the histogram contains 256 bins, using 256 threads per block proves convenient as well as results in high performance. But we have a lot of flexibility in terms of the number of blocks we launch. For example, with 100MB of data, we have 104,857,600 bytes of data. We could launch a single block and have each thread examine 409,600 data elements. Likewise, we could launch 409,600 blocks and have each thread examine a single data element.
正如您可能已经猜到的,最佳解决方案位于这两个极端之间的某个点。通过运行一些性能实验,当我们启动的块数量恰好是 GPU 包含的多处理器数量的两倍时,即可实现最佳性能。例如,GeForce GTX 280 有 30 个多处理器,因此当以 60 个并行块启动时,我们的直方图内核恰好在 GeForce GTX 280 上运行速度最快。
As you might have guessed, the optimal solution is at a point between these two extremes. By running some performance experiments, optimal performance is achieved when the number of blocks we launch is exactly twice the number of multiprocessors our GPU contains. For example, a GeForce GTX 280 has 30 multiprocessors, so our histogram kernel happens to run fastest on a GeForce GTX 280 when launched with 60 parallel blocks.
在第3章中,我们讨论了一种查询运行程序的硬件的各种属性的方法。如果我们打算根据当前的硬件平台动态调整启动大小,我们将需要使用这些设备属性之一。为了实现这一点,我们将使用以下代码段。尽管您还没有看到内核实现,但您仍然应该能够了解正在发生的事情。
In Chapter 3, we discussed a method for querying various properties of the hardware on which our program is running. We will need to use one of these device properties if we intend to dynamically size our launch based on our current hardware platform. To accomplish this, we will use the following code segment. Although you haven’t yet seen the kernel implementation, you should still be able to follow what is going on.
由于我们的演练main()有些零碎,以下是从开始到结束的整个例程:
Since our walk-through of main() has been somewhat fragmented, here is the entire routine from start to finish:
现在是有趣的部分:计算直方图的 GPU 代码!计算直方图本身的内核需要获得一个指向输入数据数组的指针、输入数组的长度以及一个指向输出直方图的指针。我们的内核需要计算的第一件事是输入数据数组的线性化偏移。每个线程将以 0 到线程数减 1 之间的偏移量开始。然后它将跨步到已启动的线程总数。我们希望你记住这个技巧;当您第一次了解线程时,我们使用相同的逻辑来添加任意长度的向量。
And now for the fun part: the GPU code that computes the histogram! The kernel that computes the histogram itself needs to be given a pointer to the input data array, the length of the input array, and a pointer to the output histogram. The first thing our kernel needs to compute is a linearized offset into the input data array. Each thread will start with an offset between 0 and the number of threads minus 1. It will then stride by the total number of threads that have been launched. We hope you remember this technique; we used the same logic to add vectors of arbitrary length when you first learned about threads.
一旦每个线程知道其起始偏移量i和应使用的步长,代码就会遍历输入数组,递增相应的直方图箱。
Once each thread knows its starting offset i and the stride it should use, the code walks through the input array incrementing the corresponding histogram bin.
突出显示的行表示我们在 CUDA C 中使用原子操作的方式。该调用atomicAdd( addr, y );生成一个原子操作序列,读取地址 处的值addr,添加y到该值,并将结果存储回内存地址addr。硬件保证我们addr在执行这些操作时没有其他线程可以读取或写入地址处的值,从而确保结果可预测。在我们的示例中,所讨论的地址是与当前字节对应的直方图箱的位置。如果当前字节是buffer[i],就像我们在 CPU 版本中看到的那样,对应的直方图 bin 是histo[buffer[i]]。原子操作需要该 bin 的地址,因此第一个参数是&(histo[buffer[i]])。由于我们只想将该 bin 中的值加一,因此第二个参数是 1。
The highlighted line represents the way we use atomic operations in CUDA C. The call atomicAdd( addr, y ); generates an atomic sequence of operations that read the value at address addr, adds y to that value, and stores the result back to the memory address addr. The hardware guarantees us that no other thread can read or write the value at address addr while we perform these operations, thus ensuring predictable results. In our example, the address in question is the location of the histogram bin that corresponds to the current byte. If the current byte is buffer[i], just like we saw in the CPU version, the corresponding histogram bin is histo[buffer[i]]. The atomic operation needs the address of this bin, so the first argument is therefore &(histo[buffer[i]]). Since we simply want to increment the value in that bin by one, the second argument is 1.
经过一番喧闹之后,我们的 GPU 直方图计算与相应的 CPU 版本非常相似。
So after all that hullabaloo, our GPU histogram computation is fairly similar to the corresponding CPU version.
然而,我们需要把庆祝活动留到以后再做。运行此示例后,我们发现 GeForce GTX 285 可以在 1.752 秒内根据 100MB 输入数据构建直方图。如果你读过基于 CPU 的直方图部分,你就会意识到这种性能很糟糕。事实上,这比 CPU 版本慢四倍多!但这就是为什么我们总是衡量我们的基准绩效。仅仅因为它在 GPU 上运行就满足于如此低性能的实现将是一种耻辱。
However, we need to save the celebrations for later. After running this example, we discover that a GeForce GTX 285 can construct a histogram from 100MB of input data in 1.752 seconds. If you read the section on CPU-based histograms, you will realize that this performance is terrible. In fact, this is more than four times slower than the CPU version! But this is why we always measure our baseline performance. It would be a shame to settle for such a low-performance implementation simply because it runs on the GPU.
由于我们在内核中做的工作很少,因此很可能是全局内存上的原子操作导致了问题。本质上,当数千个线程尝试访问少数内存位置时,可能会发生对 256 个直方图箱的大量争用。为了确保增量操作的原子性,硬件需要将操作序列化到相同的内存位置。这可能会导致等待操作的队列很长,并且我们可能获得的任何性能提升都将消失。我们需要改进算法本身才能恢复这种性能。
Since we do very little work in the kernel, it is quite likely that the atomic operation on global memory is causing the problem. Essentially, when thousands of threads are trying to access a handful of memory locations, a great deal of contention for our 256 histogram bins can occur. To ensure atomicity of the increment operations, the hardware needs to serialize operations to the same memory location. This can result in a long queue of pending operations, and any performance gain we might have had will vanish. We will need to improve the algorithm itself in order to recover this performance.
具有讽刺意味的是,尽管原子操作会导致性能下降,但缓解速度下降实际上涉及使用更多而不是更少的原子。核心问题不在于原子的使用,而在于数千个线程竞争访问相对较少数量的内存地址。为了解决这个问题,我们将直方图计算分为两个阶段。
Ironically, despite that the atomic operations cause this performance degradation, alleviating the slowdown actually involves using more atomics, not fewer. The core problem was not the use of atomics so much as the fact that thousands of threads were competing for access to a relatively small number of memory addresses. To address this issue, we will split our histogram computation into two phases.
在第一阶段,每个并行块将计算其组成线程检查的数据的单独直方图。由于每个块独立执行此操作,因此我们可以在共享内存中计算这些直方图,从而节省将每个注销芯片发送到 DRAM 的时间。但是,这样做并不能让我们摆脱原子操作的需要,因为块中的多个线程仍然可以检查具有相同值的数据元素。然而,事实上现在只有 256 个线程将竞争 256 个地址,这将减少数千个线程竞争的全局版本的争用。
In phase one, each parallel block will compute a separate histogram of the data that its constituent threads examine. Since each block does this independently, we can compute these histograms in shared memory, saving us the time of sending each write-off chip to DRAM. Doing this does not free us from needing atomic operations, though, since multiple threads within the block can still examine data elements with the same value. However, the fact that only 256 threads will now be competing for 256 addresses will reduce contention from the global version where thousands of threads were competing.
第一阶段涉及分配和清零共享内存缓冲区以保存每个块的中间直方图。回想一下第 5 章,由于后续步骤将涉及读取和修改此缓冲区,因此我们需要一个__syncthreads()调用来确保每个线程的写入在继续之前都已完成。
The first phase then involves allocating and zeroing a shared memory buffer to hold each block’s intermediate histogram. Recall from Chapter 5 that since the subsequent step will involve reading and modifying this buffer, we need a __syncthreads() call to ensure that every thread’s write has completed before progressing.
将直方图归零后,下一步与我们原始的 GPU 直方图非常相似。这里唯一的区别是我们使用共享内存缓冲区temp[]而不是全局内存缓冲区histo[],并且我们需要后续调用__syncthreads()以确保最后的写入已提交。
After zeroing the histogram, the next step is remarkably similar to our original GPU histogram. The sole differences here are that we use the shared memory buffer temp[] instead of the global memory buffer histo[] and that we need a subsequent call to __syncthreads() to ensure the last of our writes have been committed.
修改后的直方图示例的最后一步要求我们将每个块的临时直方图合并到全局缓冲区中histo[]。假设我们将输入分成两半,两个线程查看不同的两半并计算单独的直方图。如果线程 A0xFC在输入中看到字节 20 次,而线程 B 在输入中看到字节0xFC5 次,则该字节0xFC一定在输入中出现了 25 次。同样,最终直方图的每个 bin 只是线程 A 的直方图中和线程 B 的直方图中相应 bin 的总和。此逻辑扩展到任意数量的线程,因此将每个块的直方图合并为单个最终直方图涉及将块直方图中的每个条目添加到最终直方图中的相应条目。由于我们已经看到的所有原因,这需要以原子方式完成:
The last step in our modified histogram example requires that we merge each block’s temporary histogram into the global buffer histo[]. Suppose we split the input in half and two threads look at different halves and compute separate histograms. If thread A sees byte 0xFC 20 times in the input and thread B sees byte 0xFC 5 times, the byte 0xFC must have appeared 25 times in the input. Likewise, each bin of the final histogram is just the sum of the corresponding bin in thread A’s histogram and thread B’s histogram. This logic extends to any number of threads, so merging every block’s histogram into a single final histogram involves adding each entry in the block’s histogram to the corresponding entry in the final histogram. For all the reasons we’ve seen already, this needs to be done atomically:
由于我们决定使用 256 个线程并拥有 256 个直方图箱,因此每个线程自动将一个箱添加到最终直方图的总数中。如果这些数字不匹配,这个阶段就会更加复杂。请注意,我们无法保证块将其值添加到最终直方图的顺序,但由于整数加法是可交换的,因此只要加法以原子方式发生,我们将始终得到相同的答案。
Since we have decided to use 256 threads and have 256 histogram bins, each thread atomically adds a single bin to the final histogram’s total. If these numbers didn’t match, this phase would be more complicated. Note that we have no guarantees about what order the blocks add their values to the final histogram, but since integer addition is commutative, we will always get the same answer provided that the additions occur atomically.
And with this, our two phase histogram computation kernel is complete. Here it is from start to finish:
我们的直方图示例的此版本比之前的 GPU 版本有了显着改进。添加共享内存组件后,GeForce GTX 285 上的运行时间降低至 0.057 秒。这不仅明显优于仅使用全局内存原子的版本,而且比我们最初的 CPU 实现高出一个数量级(从 0.416 秒到 0.057 秒)。这一改进意味着速度比 CPU 版本提高了七倍以上。因此,尽管在使直方图适应 GPU 实现方面遇到了早期挫折,但我们使用共享原子和全局原子的版本应该被认为是成功的。
This version of our histogram example improves dramatically over the previous GPU version. Adding the shared memory component drops our running time on a GeForce GTX 285 to 0.057 seconds. Not only is this significantly better than the version that used global memory atomics only, but this beats our original CPU implementation by an order of magnitude (from 0.416 seconds to 0.057 seconds). This improvement represents greater than a sevenfold boost in speed over the CPU version. So despite the early setback in adapting the histogram to a GPU implementation, our version that uses both shared and global atomics should be considered a success.
尽管我们经常详细谈论使用 CUDA C 进行并行编程是多么容易,但我们在很大程度上忽略了一些情况:GPU 等大规模并行架构会让我们程序员的生活变得更加困难。尝试应对潜在的数万个线程同时修改相同的内存地址是一种常见的情况,在这种情况下,大规模并行机器似乎很繁重。幸运的是,我们有硬件支持的原子操作可以帮助缓解这种痛苦。
Although we have frequently spoken at length about how easy parallel programming can be with CUDA C, we have largely ignored some of the situations when massively parallel architectures such as the GPU can make our lives as programmers more difficult. Trying to cope with potentially tens of thousands of threads simultaneously modifying the same memory addresses is a common situation where a massively parallel machine can seem burdensome. Fortunately, we have hardware-supported atomic operations available to help ease this pain.
然而,正如您在直方图计算中看到的那样,有时对原子操作的依赖会带来性能问题,而这些问题只能通过重新思考算法的某些部分来解决。在直方图示例中,我们采用了两阶段算法,以减轻对全局内存地址的争用。一般来说,这种减少内存争用的策略往往效果很好,在您自己的应用程序中使用原子时应该牢记这一点。
However, as you saw with the histogram computation, sometimes reliance on atomic operations introduces performance issues that can be resolved only by rethinking parts of the algorithm. In the histogram example, we moved to a two-stage algorithm that alleviated contention for global memory addresses. In general, this strategy of looking to lessen memory contention tends to work well, and you should keep it in mind when using atomics in your own applications.
在本书中,我们一次又一次地看到 GPU 上的大规模数据并行执行引擎如何比同类 CPU 代码提供惊人的性能提升。然而,NVIDIA 图形处理器上还可以利用另一类并行性。这种并行性类似于多线程 CPU 应用程序中的任务并行性。任务并行不是像数据并行那样在大量数据元素上同时计算相同的函数,而是涉及并行执行两个或多个完全不同的任务。
Time and time again in this book we have seen how the massively data-parallel execution engine on a GPU can provide stunning performance gains over comparable CPU code. However, there is yet another class of parallelism to be exploited on NVIDIA graphics processors. This parallelism is similar to the task parallelism that is found in multithreaded CPU applications. Rather than simultaneously computing the same function on lots of data elements as one does with data parallelism, task parallelism involves doing two or more completely different tasks in parallel.
在并行性的背景下,任务可以是任意数量的事物。例如,应用程序可能正在执行两项任务:使用一个线程重绘其 GUI,同时使用另一个线程通过网络下载更新。尽管这些任务没有任何共同点,但它们是并行进行的。尽管 GPU 上的任务并行性目前不如通用处理器那样灵活,但它仍然为我们程序员提供了从基于 GPU 的实现中获得更高速度的机会。在本章中,我们将研究 CUDA 流以及如何仔细使用它们,使我们能够在 GPU 上同时执行某些操作。
In the context of parallelism, a task could be any number of things. For example, an application could be executing two tasks: redrawing its GUI with one thread while downloading an update over the network with another thread. These tasks proceed in parallel, despite having nothing in common. Although the task parallelism on GPUs is not currently as flexible as a general-purpose processor’s, it still provides opportunities for us as programmers to extract even more speed from our GPU-based implementations. In this chapter, we will look at CUDA streams and the ways in which their careful use will enable us to execute certain operations simultaneously on the GPU.
通过本章的课程,您将完成以下任务:
Through the course of this chapter, you will accomplish the following:
• 您将了解如何分配页锁定主机内存。
• You will learn about allocating page-locked host memory.
• 您将了解什么是CUDA流。
• You will learn what CUDA streams are.
• 您将学习如何使用CUDA 流来加速您的应用程序。
• You will learn how to use CUDA streams to accelerate your applications.
在九章课程的每个示例中,您都看到我们使用cudaMalloc().在主机上,我们始终使用普通的 C 库例程分配内存malloc()。但是,CUDA 运行时提供了自己的分配主机内存的机制:cudaHostAlloc()。既然这个函数malloc()从你作为一名 C 程序员的第一天起就已经为你提供了很好的帮助,为什么你还要费心去使用这个函数呢?
In every example over the course of nine chapters, you have seen us allocate memory on the GPU with cudaMalloc(). On the host, we have always allocated memory with the vanilla, C library routine malloc(). However, the CUDA runtime offers its own mechanism for allocating host memory: cudaHostAlloc(). Why would you bother using this function when malloc() has served you quite well since day one of your life as a C programmer?
事实上,将malloc()分配的内存和实际分配的内存之间存在显着差异cudaHostAlloc()。 C 库函数malloc()分配标准的可分页主机内存,同时分配页锁定主机内存cudaHostAlloc()的缓冲区。页面锁定缓冲区有时称为固定内存,它有一个重要的属性:操作系统向我们保证它永远不会将此内存分页到磁盘,从而确保其驻留在物理内存中。其推论是,操作系统允许应用程序访问内存的物理地址变得安全,因为缓冲区不会被逐出或重新定位。
In fact, there is a significant difference between the memory that malloc() will allocate and the memory that cudaHostAlloc() allocates. The C library function malloc() allocates standard, pageable host memory, while cudaHostAlloc() allocates a buffer of page-locked host memory. Sometimes called pinned memory, page-locked buffers have an important property: The operating system guarantees us that it will never page this memory out to disk, which ensures its residency in physical memory. The corollary to this is that it becomes safe for the OS to allow an application access to the physical address of the memory, since the buffer will not be evicted or relocated.
知道缓冲区的物理地址后,GPU 就可以使用直接内存访问 (DMA) 将数据复制到主机或从主机复制数据。由于 DMA 复制的进行无需 CPU 的干预,因此这也意味着 CPU 可以同时将这些缓冲区分页到磁盘或通过更新操作系统的页表来重新定位其物理地址。 CPU 移动可分页数据的可能性意味着使用固定内存进行 DMA 复制至关重要。事实上,即使当您尝试使用可分页内存执行内存复制时,CUDA 驱动程序仍然使用 DMA 将缓冲区传输到 GPU。因此,你的副本发生两次,首先从可分页系统缓冲区到页锁定“暂存”缓冲区,然后从页锁定系统缓冲区到 GPU。
Knowing the physical address of a buffer, the GPU can then use direct memory access (DMA) to copy data to or from the host. Since DMA copies proceed without intervention from the CPU, it also means that the CPU could be simultaneously paging these buffers out to disk or relocating their physical address by updating the operating system’s pagetables. The possibility of the CPU moving pageable data means that using pinned memory for a DMA copy is essential. In fact, even when you attempt to perform a memory copy with pageable memory, the CUDA driver still uses DMA to transfer the buffer to the GPU. Therefore, your copy happens twice, first from a pageable system buffer to a page-locked “staging” buffer and then from the page-locked system buffer to the GPU.
因此,每当您从可分页内存执行内存复制时,您都可以保证复制速度将受到PCIE 传输速度和系统前端总线速度中较低者的限制。在某些系统中,这些总线之间的带宽差异很大,这确保了当用于在 GPU 和主机之间复制数据时,页面锁定主机内存比标准可分页内存具有大约两倍的性能优势。但即使在 PCI Express 和前端总线速度相同的情况下,可分页缓冲区仍然会产生额外的 CPU 管理副本的开销。
As a result, whenever you perform memory copies from pageable memory, you guarantee that the copy speed will be bounded by the lower of the PCIE transfer speed and the system front-side bus speeds. A large disparity in bandwidth between these buses in some systems ensures that page-locked host memory enjoys roughly a twofold performance advantage over standard pageable memory when used for copying data between the GPU and the host. But even in a world where PCI Express and front-side bus speeds were identical, pageable buffers would still incur the overhead of an additional CPU-managed copy.
但是,您应该抵制诱惑,不要简单地对malloc进行搜索和替换,以将您的每一个调用转换为 use cudaHostAlloc()。使用固定内存是一把双刃剑。通过这样做,您实际上已经选择了虚拟内存的所有优秀功能。具体来说,运行应用程序的计算机需要为每个页锁定缓冲区提供可用的物理内存,因为这些缓冲区永远无法换出到磁盘。这意味着您的系统将比坚持标准malloc()调用更快地耗尽内存。这不仅意味着您的应用程序可能在物理内存量较小的计算机上开始失败,而且还意味着您的应用程序可能会影响系统上运行的其他应用程序的性能。
However, you should resist the temptation to simply do a search-and-replace on malloc to convert every one of your calls to use cudaHostAlloc(). Using pinned memory is a double-edged sword. By doing so, you have effectively opted out of all the nice features of virtual memory. Specifically, the computer running the application needs to have available physical memory for every page-locked buffer, since these buffers can never be swapped out to disk. This means that your system will run out of memory much faster than it would if you stuck to standard malloc() calls. Not only does this mean that your application might start to fail on machines with smaller amounts of physical memory, but it means that your application can affect the performance of other applications running on the system.
这些警告并不是为了吓退您使用cudaHostAlloc(),但您应该始终了解页面锁定缓冲区的含义。我们建议尝试将它们的使用限制在将用作调用中的源或目标的内存cudaMemcpy(),并在不再需要它们时释放它们,而不是等到应用程序关闭才释放内存。使用cudaHostAlloc()应该不会比您迄今为止学习的任何其他内容更困难,但让我们看一个示例,它将说明如何分配固定内存并展示其相对于标准可分页内存的性能优势。
These warnings are not meant to scare you out of using cudaHostAlloc(), but you should remain aware of the implications of page-locking buffers. We suggest trying to restrict their use to memory that will be used as a source or destination in calls to cudaMemcpy() and freeing them when they are no longer needed rather than waiting until application shutdown to release the memory. The use of cudaHostAlloc() should be no more difficult than anything else you’ve studied so far, but let’s take a look at an example that will both illustrate how pinned memory is allocated and demonstrate its performance advantage over standard pageable memory.
我们的应用程序将非常简单,主要用于对cudaMemcpy()可分页内存和页锁定内存进行性能基准测试。我们要做的就是分配一个大小匹配的 GPU 缓冲区和一个主机缓冲区,然后在这两个缓冲区之间执行一定数量的副本。我们将允许该基准测试的用户指定复制的方向,“向上”(从主机到设备)或“向下”(从设备到主机)。您还会注意到,为了获得准确的计时,我们为启动和停止设置了 CUDA 事件副本的顺序。您可能还记得如何从之前的性能测试示例中执行此操作,但如果您忘记了,以下内容将唤起您的记忆:
Our application will be very simple and serves primarily to benchmark cudaMemcpy() performance with both pageable and page-locked memory. All we endeavor to do is allocate a GPU buffer and a host buffer of matching sizes and then execute some number of copies between these two buffers. We’ll allow the user of this benchmark to specify the direction of the copy, either “up” (from host to device) or “down” (from device to host). You will also notice that, in order to obtain accurate timings, we set up CUDA events for the start and stop of the sequence of copies. You probably remember how to do this from previous performance-testing examples, but in case you’ve forgotten, the following will jog your memory:
与副本的方向无关,我们首先分配主机和 GPUsize整数缓冲区。之后,我们按照参数指定的方向进行 100 次复制up,并在完成复制后停止计时器。
Independent of the direction of the copies, we start by allocating a host and GPU buffer of size integers. After this, we do 100 copies in the direction specified by the argument up, stopping the timer after we’ve finished copying.
在 100 个副本之后,通过释放主机和 GPU 缓冲区以及销毁我们的计时事件来进行清理。
After the 100 copies, clean up by freeing the host and GPU buffers as well as destroying our timing events.
如果您没有注意到,该函数cuda_malloc_test()使用标准 Cmalloc()例程分配可分页主机内存。固定内存版本用于cudaHostAlloc()分配页锁定缓冲区。
If you didn’t notice, the function cuda_malloc_test() allocated pageable host memory with the standard C malloc() routine. The pinned memory version uses cudaHostAlloc() to allocate a page-locked buffer.
正如您所看到的,由 分配的缓冲区的cudaHostAlloc()使用方式与 分配的缓冲区的使用方式相同malloc()。 using 的另一个变化malloc()在于最后一个参数,值cudaHostAllocDefault。最后一个参数存储了一组标志,我们可以使用它们来修改其行为,cudaHostAlloc()以便分配其他类型的固定主机内存。在下一章中,我们将看到如何使用这些标志的其他可能值,但现在我们满足于使用默认的页面锁定内存,因此我们通过cudaHostAllocDefault以获得默认行为。要释放分配给 的缓冲区cudaHostAlloc(),我们必须使用cudaFreeHost()。也就是说,每个人malloc()都需要一个free(),每个人都cudaHostAlloc()需要一个cudaFreeHost()。
As you can see, the buffer allocated by cudaHostAlloc() is used in the same way as a buffer allocated by malloc(). The other change from using malloc() lies in the last argument, the value cudaHostAllocDefault. This last argument stores a collection of flags that we can use to modify the behavior of cudaHostAlloc() in order to allocate other varieties of pinned host memory. In the next chapter, we’ll see how to use the other possible values of these flags, but for now we’re content to use the default, page-locked memory so we pass cudaHostAllocDefault in order to get the default behavior. To free a buffer that was allocated with cudaHostAlloc(), we have to use cudaFreeHost(). That is, every malloc() needs a free(), and every cudaHostAlloc() needs a cudaFreeHost().
收益主体main()与您的预期并无不同。
The body of main() proceeds not unlike what you would expect.
因为up的参数cuda_malloc_test()是true,所以前面的调用测试从主机到设备或“向上”到设备的副本的性能。为了对相反方向的调用进行基准测试,我们执行相同的调用,但使用false第二个参数。
Because the up argument to cuda_malloc_test() is true, the previous call tests the performance of copies from host to device, or “up” to the device. To benchmark the calls in the opposite direction, we execute the same calls but with false as the second argument.
我们执行相同的一组步骤来测试 的性能cudaHostAlloc()。我们调用cuda_ host_alloc_test()两次,一次使用upas true,一次使用 it false。
We perform the same set of steps to test the performance of cudaHostAlloc(). We call cuda_ host_alloc_test() twice, once with up as true and once with it false.
在 GeForce GTX 285 上,当我们使用固定内存而不是可分页内存时,我们观察到从主机到设备的副本速度从 2.77GB/s 提高到 5.11GB/s。从设备到主机的复制速度也有类似的提高,从 2.43GB/s 提高到 5.46GB/s。因此,对于大多数 PCIE 带宽受限的应用程序,您会注意到使用固定内存与标准可分页内存相比有显着的改进。但页锁定内存不仅仅是为了提高性能。正如我们将在下一节中看到的,在某些情况下我们需要使用页锁定内存。
On a GeForce GTX 285, we observed copies from host to device improving from 2.77GB/s to 5.11GB/s when we use pinned memory instead of pageable memory. Copies from the device down to the host improve similarly, from 2.43GB/s to 5.46GB/s. So, for most PCIE bandwidth-limited applications, you will notice a marked improvement when using pinned memory versus standard pageable memory. But page-locked memory is not solely for performance enhancements. As we’ll see in the next sections, there are situations where we are required to use page-locked memory.
在第6章中,我们介绍了CUDA事件的概念。在此过程中,我们推迟了对第二个参数的深入讨论cudaEventRecord(),而是仅提及它指定了我们要向其中插入事件的流。
In Chapter 6, we introduced the concept of CUDA events. In doing so, we postponed an in-depth discussion of the second argument to cudaEventRecord(), instead mentioning only that it specified the stream into which we were inserting the event.
CUDA 流可以在加速应用程序方面发挥重要作用。 CUDA流表示按特定顺序执行的 GPU 操作队列。我们可以将内核启动、内存复制、事件启动和停止等操作添加到流中。将操作添加到流中的顺序指定了它们的执行顺序。您可以将每个流视为 GPU 上的一个任务,并且这些任务有机会并行执行。我们将首先了解如何使用流,然后了解如何使用流来加速您的应用程序。
CUDA streams can play an important role in accelerating your applications. A CUDA stream represents a queue of GPU operations that get executed in a specific order. We can add operations such as kernel launches, memory copies, and event starts and stops into a stream. The order in which operations are added to the stream specifies the order in which they will be executed. You can think of each stream as a task on the GPU, and there are opportunities for these tasks to execute in parallel. We’ll first see how streams are used, and then we’ll look at how you can use streams to accelerate your applications.
正如我们稍后将看到的,只有当我们使用多个流时,流的真正威力才会变得明显,但我们将开始说明在仅采用单个流的应用程序中使用它们的机制。想象一下,我们有一个 CUDA C 内核,它将接受两个数据输入缓冲区,a并且b.内核将根据这些缓冲区中的值的组合计算一些结果,以生成输出缓冲区c。我们的向量加法示例做了一些事情a这些行,但在本示例中,我们将计算 中 的三个值和 中的三个值的平均值b:
As we’ll see later, the real power of streams becomes apparent only when we use more than one of them, but we’ll begin to illustrate the mechanics of their use within an application that employs just a single stream. Imagine that we have a CUDA C kernel that will take two input buffers of data, a and b. The kernel will compute some result based on a combination of values in these buffers to produce an output buffer c. Our vector addition example did something along these lines, but in this example we’ll compute an average of three values in a and three values in b:
这个内核并不是非常重要,所以如果您不确定它到底应该计算什么,请不要太关注它。它是一个占位符,因为此示例中与流相关的重要组件位于main().
This kernel is not incredibly important, so don’t get too hung up on it if you aren’t sure exactly what it’s supposed to be computing. It’s something of a placeholder since the important, stream-related component of this example resides in main().
我们要做的第一件事是选择一个设备并检查它是否支持称为设备重叠的功能。支持设备重叠的 GPU 能够在执行设备和主机内存之间的复制时同时执行 CUDA C 内核。正如我们之前所承诺的,我们将使用多个流来实现计算和数据传输的重叠,但首先我们将了解如何创建和使用单个流。与我们所有旨在衡量性能改进(或回归)的示例一样,我们首先创建并启动一个事件计时器:
The first thing we do is choose a device and check to see whether it supports a feature known as device overlap. A GPU supporting device overlap possesses the capacity to simultaneously execute a CUDA C kernel while performing a copy between device and host memory. As we’ve promised before, we’ll use multiple streams to achieve this overlap of computation and data transfer, but first we’ll see how to create and use a single stream. As with all of our examples that aim to measure performance improvements (or regressions), we begin by creating and starting an event timer:
启动计时器后,我们创建要用于此应用程序的流:
After starting our timer, we create the stream we want to use for this application:
是的,这几乎就是创建流所需的全部内容。这并不值得深究,所以让我们继续讨论数据分配。
Yeah, that’s pretty much all it takes to create a stream. It’s not really worth dwelling on, so let’s press on to the data allocation.
我们已经在 GPU 和主机上分配了输入和输出缓冲区。请注意,我们决定通过使用cudaHostAlloc()来执行分配来使用主机上的固定内存。使用固定内存有一个很好的理由,但严格来说这并不是因为它可以使复制速度更快。我们稍后会详细介绍,但我们将使用一种新的cudaMemcpy()函数,而这个新函数要求主机内存被页面锁定。分配输入缓冲区后,我们使用 C 库调用用随机整数填充主机分配rand()。
We have allocated our input and output buffers on both the GPU and the host. Notice that we’ve decided to use pinned memory on the host by using cudaHostAlloc() to perform the allocations. There is a very good reason for using pinned memory, and it’s not strictly because it makes copies faster. We’ll see in detail momentarily, but we will be using a new kind of cudaMemcpy() function, and this new function requires that the host memory be page-locked. After allocating the input buffers, we fill the host allocations with random integers using the C library call rand().
创建流和计时事件并分配设备和主机缓冲区后,我们就可以执行一些计算了!通常,我们通过将两个输入缓冲区复制到 GPU、启动内核并将输出缓冲区复制回主机来完成此阶段。我们将再次遵循这种模式,但这次有一些小的改变。
With our stream and our timing events created and our device and host buffers allocated, we’re ready to perform some computations! Typically we blast through this stage by copying the two input buffers to the GPU, launching our kernel, and copying the output buffer back to the host. We will follow this pattern again, but this time with some small changes.
首先,我们选择不将输入缓冲区全部复制到 GPU。相反,我们将把输入分成更小的块,并对每个块执行三步过程。也就是说,我们将获取输入缓冲区的一部分,将它们复制到 GPU,在该部分缓冲区上执行我们的内核,并将输出缓冲区的结果部分复制回主机。想象一下我们需要这样做是因为我们的 GPU 的内存比主机少得多,因此计算需要分块进行,因为整个缓冲区无法一次容纳在 GPU 上。执行此“分块”计算序列的代码如下所示:
First, we will opt not to copy the input buffers in their entirety to the GPU. Rather, we will split our inputs into smaller chunks and perform the three-step process on each chunk. That is, we will take some fraction of the input buffers, copy them to the GPU, execute our kernel on that fraction of the buffers, and copy the resulting fraction of the output buffer back to the host. Imagine that we need to do this because our GPU has much less memory than our host does, so the computation needs to be staged in chunks because the entire buffer can’t fit on the GPU at once. The code to perform this “chunkified” sequence of computations will look like this:
但您会注意到前面摘录中的另外两个与规范不同的意外变化。首先,我们没有使用熟悉的cudaMemcpy(),而是使用新例程 将数据复制到 GPU 或从 GPU 复制数据cudaMemcpyAsync()。这些功能之间的差异虽然微妙,但意义重大。原来的cudaMemcpy()行为类似于 C 库函数memcpy()。具体来说,该函数同步执行,这意味着当函数返回时,复制已完成,输出缓冲区现在包含应该复制到其中的内容。
But you will notice two other unexpected shifts from the norm in the preceding excerpt. First, instead of using the familiar cudaMemcpy(), we’re copying the data to and from the GPU with a new routine, cudaMemcpyAsync(). The difference between these functions is subtle yet significant. The original cudaMemcpy() behaves like the C library function memcpy(). Specifically, this function executes synchronously, meaning that when the function returns, the copy has completed, and the output buffer now contains the contents that were supposed to be copied into it.
同步函数的反义词是异步函数,这也是异步函数名称的灵感来源cudaMemcpyAsync()。对 的调用cudaMemcpyAsync()只是发出一个请求,以将内存复制到参数指定的流中stream。当调用返回时,无法保证复制已经开始,更不用说已经完成了。我们的保证是复制肯定会在下一个操作放入同一流之前执行。要求传递给的任何主机内存指针cudaMemcpyAsync()都已由 分配cudaHostAlloc()。也就是说,您只能安排异步复制到页锁定内存或从页锁定内存复制异步副本。
The opposite of a synchronous function is an asynchronous function, which inspired the name cudaMemcpyAsync(). The call to cudaMemcpyAsync() simply places a request to perform a memory copy into the stream specified by the argument stream. When the call returns, there is no guarantee that the copy has even started yet, much less that it has finished. The guarantee that we have is that the copy will definitely be performed before the next operation placed into the same stream. It is required that any host memory pointers passed to cudaMemcpyAsync() have been allocated by cudaHostAlloc(). That is, you are only allowed to schedule asynchronous copies to or from page-locked memory.
请注意,尖括号内的内核启动还采用可选的流参数。此内核启动是异步的,就像前面两个到 GPU 的内存复制以及从 GPU 返回的尾随内存复制一样。从技术上讲,我们可以结束此循环的迭代,而无需实际启动任何内存副本或内核执行。正如我们所提到的,我们所保证的是放入流中的第一个副本将在第二个副本之前执行。此外,第二个副本将在内核启动之前完成,而内核将在第三个副本启动之前完成。正如我们在本章前面提到的,流的作用就像 GPU 执行的有序工作队列一样。
Notice that the angle-bracketed kernel launch also takes an optional stream argument. This kernel launch is asynchronous, just like the preceding two memory copies to the GPU and the trailing memory copy back from the GPU. Technically, we can end an iteration of this loop without having actually started any of the memory copies or kernel execution. As we mentioned, all that we are guaranteed is that the first copy placed into the stream will execute before the second copy. Moreover, the second copy will complete before the kernel starts, and the kernel will complete before the third copy starts. So as we’ve mentioned earlier in this chapter, a stream acts just like an ordered queue of work for the GPU to perform.
当for()循环终止时,仍然可能有相当多的工作在排队等待 GPU 完成。如果我们想保证 GPU 完成其计算和内存副本,我们需要将其与主机同步。也就是说,我们基本上想要告诉主机坐下来等待 GPU 完成,然后再继续。我们通过调用cudaStreamSynchronize()并指定我们想要等待的流来实现这一点:
When the for() loop has terminated, there could still be quite a bit of work queued up for the GPU to finish. If we would like to guarantee that the GPU is done with its computations and memory copies, we need to synchronize it with the host. That is, we basically want to tell the host to sit around and wait for the GPU to finish before proceeding. We accomplish that by calling cudaStreamSynchronize() and specifying the stream that we want to wait for:
由于计算和复制在stream与主机同步后已完成,因此我们可以停止计时器,收集性能数据并释放输入和输出缓冲区。
Since the computations and copies have completed after synchronizing stream with the host, we can stop our timer, collect our performance data, and free our input and output buffers.
最后,在退出应用程序之前,我们销毁用于对 GPU 操作进行排队的流。
Finally, before exiting the application, we destroy the stream that we were using to queue the GPU operations.
老实说,这个例子并没有很好地展示流的强大功能。当然,如果我们想要在主机上完成工作,而 GPU 正忙于处理我们填充到流中的工作,那么即使使用单个流也可以帮助加快应用程序的速度。但是假设我们在主机上没有太多事情要做,我们仍然可以通过使用流来加速应用程序,在下一节中我们将看看如何实现这一点。
To be honest, this example has done very little to demonstrate the power of streams. Of course, even using a single stream can help speed up an application if we have work we want to complete on the host while the GPU is busy churning through the work we’ve stuffed into a stream. But assuming that we don’t have much to do on the host, we can still speed up applications by using streams, and in the next section we’ll take a look at how this can be accomplished.
让我们改编第 10.4 节中的单流示例:使用单个 CUDA 流在两个不同的流中执行其工作。在上一个示例的开头,我们检查了设备确实支持重叠和将计算分成块。该应用程序改进版本的基本思想很简单,依赖于两件事:“分块”计算以及内存副本与内核执行的重叠。我们努力让流 1 将其输入缓冲区复制到 GPU,同时流 0 正在执行其内核。然后流 1 将执行其内核,而流 0 将其结果复制到主机。然后,流 1 将其结果复制到主机,而流 0 开始对下一个数据块执行其内核。假设我们的内存复制和内核执行花费的时间大致相同,我们的应用程序的执行时间线可能如图10.1所示。该图假设 GPU 可以同时执行内存复制和内核执行,因此空框表示一个流等待执行不能与另一流的操作重叠的操作的时间。另请注意cudaMemcpyAsync(),本章其余图中的调用均缩写为“ memcpy”。
Let’s adapt the single-stream example from Section 10.4: Using a Single CUDA Stream to perform its work in two different streams. At the beginning of the previous example, we checked that the device indeed supported overlap and broke the computation into chunks. The idea underlying the improved version of this application is simple and relies on two things: the “chunked” computation and the overlap of memory copies with kernel execution. We endeavor to get stream 1 to copy its input buffers to the GPU while stream 0 is executing its kernel. Then stream 1 will execute its kernel while stream 0 copies its results to the host. Stream 1 will then copy its results to the host while stream 0 begins executing its kernel on the next chunk of data. Assuming that our memory copies and kernel executions take roughly the same amount of time, our application’s execution timeline might look something like Figure 10.1. The figure assumes that the GPU can perform a memory copy and a kernel execution at the same time, so empty boxes represent time when one stream is waiting to execute an operation that it cannot overlap with the other stream’s operation. Note also that calls to cudaMemcpyAsync() are abbreviated in the remaining figures in this chapter, represented simply as “memcpy.”
图 10.1使用两个独立流执行预期应用程序的时间线
Figure 10.1 Timeline of intended application execution using two independent streams
事实上,执行时间表可能比这更有利;一些较新的 NVIDIA GPU 支持同时内核执行和两个内存副本,一份到设备,一份来自设备。但在任何支持内存副本和内核执行重叠的设备上,当我们使用多个流时,整个应用程序应该会加速。
In fact, the execution timeline can be even more favorable than this; some newer NVIDIA GPUs support simultaneous kernel execution and two memory copies, one to the device and one from the device. But on any device that supports the overlap of memory copies and kernel execution, the overall application should accelerate when we use multiple streams.
尽管有这些加速我们应用程序的宏伟计划,但计算内核将保持不变。
Despite these grand plans to accelerate our application, the computation kernel will remain unchanged.
与单流版本一样,我们将检查设备是否支持内存复制的重叠计算。如果设备确实支持重叠,我们将像之前一样继续创建 CUDA 事件来为应用程序计时。
As with the single stream version, we will check that the device supports overlapping computation with memory copy. If the device does support overlap, we proceed as we did before by creating CUDA events to time the application.
接下来,我们创建两个流,就像我们在上一节的代码版本中创建单个流一样。
Next, we create our two streams exactly as we created the single stream in the previous section’s version of the code.
我们假设主机上仍然有两个输入缓冲区和一个输出缓冲区。输入缓冲区填充了随机数据,与该应用程序的单流版本中的情况完全相同。然而,现在我们打算使用两个流来处理数据,我们分配两组相同的 GPU 缓冲区,以便每个流可以独立地处理输入块。
We will assume that we still have two input buffers and a single output buffer on the host. The input buffers are filled with random data exactly as they were in the single-stream version of this application. However, now that we intend to use two streams to process the data, we allocate two identical sets of GPU buffers so that each stream can independently work on chunks of the input.
然后,我们循环遍历输入块,就像我们第一次尝试此应用程序时所做的那样。但现在我们使用两个流,我们在循环的每次迭代中处理两倍的数据for()。在 中stream0,我们将a和的异步副本排队b到 GPU,对内核执行进行排队,然后将副本排队回到c:
We then loop over the chunks of input exactly as we did in the first attempt at this application. But now that we’re using two streams, we process twice as much data in each iteration of the for() loop. In stream0, we queue asynchronous copies of a and b to the GPU, queue a kernel execution, and then queue a copy back to c:
将这些操作放入队列后stream0,我们对下一个数据块进行相同的操作,但这次是在 中stream1。
After queuing these operations in stream0, we queue identical operations on the next chunk of data, but this time in stream1.
因此,我们的for()循环继续进行,交替将每个数据块排队到的流,直到将每块输入数据排队等待处理。终止for()循环后,我们在停止应用程序计时器之前将 GPU 与 CPU 同步。由于我们在两个流中工作,因此我们需要同步两个流。
And so our for() loop proceeds, alternating the streams to which it queues each chunk of data until it has queued every piece of input data for processing. After terminating the for() loop, we synchronize the GPU with the CPU before we stop our application timers. Since we are working in two streams, we need to synchronize both.
main()我们以与结束单流实现相同的方式结束。我们停止计时器,显示经过的时间,并自行清理。当然,我们记得我们现在需要销毁两个流并释放两倍的 GPU 缓冲区,但除此之外,此代码与我们已经看到的代码相同:
We wrap up main() the same way we concluded our single-stream implementation. We stop our timers, display the elapsed time, and clean up after ourselves. Of course, we remember that we now need to destroy two streams and free twice as many GPU buffers, but aside from that, this code is identical to what we’ve seen already:
我们对第 10.4 节:使用单个 CUDA 流中的原始单流实现和 GeForce GTX 285 上改进的双流版本进行了基准测试。原始版本需要 62 毫秒才能完成运行。修改为使用两个流后,需要61ms。
We benchmarked both the original, single-stream implementation from Section 10.4: Using a Single CUDA Stream and the improved double-stream version on a GeForce GTX 285. The original version takes 62ms to run to completion. After modifying it to use two streams, it takes 61ms.
呃哦。
Uh-oh.
好消息是,这就是我们费心安排申请时间的原因。有时,我们最善意的性能“增强”只不过是给代码带来不必要的复杂性。
Well, the good news is that this is the reason we bother to time our applications. Sometimes, our most well-intended performance “enhancements” do nothing more than introduce unnecessary complications to the code.
但为什么这个应用程序没有变得更快呢?我们甚至说它会变得更快!不过,不要失去希望,因为我们实际上可以使用第二个流来加速单流版本,但我们需要更多地了解 CUDA 驱动程序如何处理流,以便获得设备重叠的回报。要了解流在幕后如何工作,我们需要了解 CUDA 驱动程序以及 CUDA 硬件架构的工作原理。
But why didn’t this application get any faster? We even said that it would get faster! Don’t lose hope yet, though, because we actually can accelerate the single-stream version with a second stream, but we need to understand a bit more about how streams are handled by the CUDA driver in order to reap the rewards of device overlap. To understand how streams work behind the scenes, we’ll need to look at both the CUDA driver and how the CUDA hardware architecture works.
尽管流是在 GPU 上执行的逻辑上独立的操作队列,但事实证明这种抽象并不完全符合 GPU 的排队机制。作为程序员,我们将流视为由内存副本和内核混合组成的有序操作序列调用。然而,硬件没有流的概念。相反,它具有一个或多个执行内存复制的引擎和一个执行内核的引擎。这些引擎彼此独立地对命令进行排队,从而形成如图10.2所示的任务调度场景。图中的箭头说明了已排队到流中的操作如何在实际执行它们的硬件引擎上进行调度。
Although streams are logically independent queues of operations to be executed on the GPU, it turns out that this abstraction does not exactly match the GPU’s queuing mechanism. As programmers, we think about our streams as ordered sequences of operations composed of a mixture of memory copies and kernel invocations. However, the hardware has no notion of streams. Rather, it has one or more engines to perform memory copies and an engine to execute kernels. These engines queue commands independently from each other, resulting in a task-scheduling scenario like the one shown in Figure 10.2. The arrows in the figure illustrate how operations that have been queued into streams get scheduled on the hardware engines that actually execute them.
因此,用户和硬件对于如何对 GPU 工作进行排队有一些正交的概念,而让用户和硬件双方都满意的负担就落在了 CUDA 驱动程序上。首先也是最重要的,存在由操作添加到流的顺序指定的重要依赖关系。例如,在图 10.2中,流 0 的 A 内存复制需要在 B 的内存复制之前完成,而 B 的内存复制又需要在内核 A 启动之前完成。但是,一旦这些操作被放入硬件的复制引擎和内核引擎队列中,这些依赖关系就会丢失,因此 CUDA 驱动程序需要通过确保硬件执行单元仍然满足流内依赖关系来让每个人都满意。
So, the user and the hardware have somewhat orthogonal notions of how to queue GPU work, and the burden of keeping both the user and hardware sides of this equation happy falls on the CUDA driver. First and foremost, there are important dependencies specified by the order in which operations are added to streams. For example, in Figure 10.2, stream 0’s memory copy of A needs to be completed before its memory copy of B, which in turn needs to be completed before kernel A is launched. But once these operations are placed into the hardware’s copy engine and kernel engine queues, these dependencies are lost, so the CUDA driver needs to keep everyone happy by ensuring that the intrastream dependencies remain satisfied by the hardware’s execution units.
Figure 10.2 Mapping of CUDA streams onto GPU engines
这对我们意味着什么?好吧,让我们看看第 10.5 节:使用多个 CUDA 流中的示例实际发生了什么。如果我们查看代码,我们会发现我们的应用程序基本上相当于 a cudaMemcpyAsync()of a,cudaMemcpyAsync()of b,我们的内核执行,然后 a cudaMemcpyAsync()ofc返回主机。应用程序将流 0 中的所有操作排队,然后是流 1 中的所有操作。CUDA 驱动程序按照指定的顺序在硬件上安排这些操作,从而保持引擎间的依赖关系。这些依赖关系如图 10.3所示,其中从副本到内核的箭头表示副本依赖于内核在开始之前完成执行。
What does this mean to us? Well, let’s look at what’s actually happening with our example in Section 10.5: Using Multiple CUDA Streams. If we review the code, we see that our application basically amounts to a cudaMemcpyAsync() of a, cudaMemcpyAsync() of b, our kernel execution, and then a cudaMemcpyAsync() of c back to the host. The application enqueues all the operations from stream 0 followed by all the operations from stream 1. The CUDA driver schedules these operations on the hardware for us in the order they were specified, keeping the interengine dependencies straight. These dependencies are illustrated in Figure 10.3 where an arrow from a copy to a kernel indicates that the copy depends on the kernel completing execution before it can begin.
图 10.3箭头描述了第 10.5 节:使用多个 CUDA 流cudaMemcpyAsync()的示例中调用对内核执行的依赖性
Figure 10.3 Arrows depicting the dependency of cudaMemcpyAsync() calls on kernel executions in the example from Section 10.5: Using Multiple CUDA Streams
鉴于我们对 GPU 调度如何工作的新了解,我们可以查看图 10.4中这些如何在硬件上执行的时间线。
Given our newfound understanding of how the GPU schedules work, we can look at a timeline of how these get executed on the hardware in Figure 10.4.
图 10.4第 10.5 节:使用多个 CUDA 流示例的执行时间线
Figure 10.4 Execution timeline of the example from Section 10.5: Using Multiple CUDA Streams
c由于流 0返回主机的副本取决于其内核执行的完成情况,因此流 1 完全独立的 GPUa副本b会被阻止,因为 GPU 引擎按照其提供的顺序执行工作。这种低效率解释了为什么我们的应用程序的双流版本完全没有加速。缺乏改进是我们假设硬件的工作方式与 CUDA 流编程模型所暗示的方式相同的直接结果。
Because stream 0’s copy of c back to the host depends on its kernel execution completing, stream 1’s completely independent copies of a and b to the GPU get blocked because the GPU’s engines execute work in the order it’s provided. This inefficiency explains why the two-stream version of our application showed absolutely no speedup. The lack of improvement is a direct result of our assumption that the hardware works in the same manner as the CUDA stream programming model implies.
这个故事的寓意是,我们作为程序员需要提供帮助,以确保独立流实际上并行执行。请记住,硬件具有处理内存副本和内核执行的独立引擎,我们需要始终意识到,我们将这些操作排入流中的顺序将影响 CUDA 驱动程序安排这些操作执行的方式。在下一节中,我们将了解如何帮助硬件实现内存副本和内核执行的重叠。
The moral of this story is that we as programmers need to help out when it comes to ensuring that independent streams actually get executed in parallel. Keeping in mind that the hardware has independent engines that handle memory copies and kernel executions, we need to remain aware that the order in which we enqueue these operations in our streams will affect the way in which the CUDA driver schedules these for execution. In the next section, we’ll see how to help the hardware achieve overlap of memory copies and kernel execution.
正如我们在上一节中看到的,如果我们立即安排一个特定流的所有操作,则很容易无意中阻止另一个流的副本或内核执行。为了缓解这个问题,我们只需在流中以广度优先的方式对操作进行排队,而不是深度优先。也就是说,我们不是在开始调度流 1 之前将的副本a、内核执行的副本b、内核执行的副本添加到流 0,而是在c流分配工作。我们将 的副本添加a到流 0,然后将 的副本添加a到流 1。然后将 的副本添加b到流 0,然后将 的副本添加b到流 1。我们将内核调用排入流 0 中,然后我们将 1 的副本放入流 1 中。最后,我们将 的副本放入c流 0 中的主机中,然后将 的副本c放入流 1 中。
As we saw in the previous section, if we schedule all of a particular stream’s operations at once, it’s very easy to inadvertently block the copies or kernel executions of another stream. To alleviate this problem, it suffices to enqueue our operations breadth-first across streams rather than depth-first. That is, rather than add the copy of a, copy of b, kernel execution, and copy of c to stream 0 before starting to schedule on stream 1, we bounce back and forth between the streams assigning work. We add the copy of a to stream 0, and then we add the copy of a to stream 1. Then we add the copy of b to stream 0, and then we add the copy of b to stream 1. We enqueue the kernel invocation in stream 0, and then we enqueue one in stream 1. Finally, we enqueue the copy of c back to the host in stream 0 followed by the copy of c in stream 1.
为了使这一点更加具体,让我们看一下代码。我们所做的只是将操作分配给两个流的顺序,因此这将是严格的复制粘贴优化。应用程序中的其他所有内容都将保持不变,这意味着我们的改进仅限于循环for()。对两个流的新的广度优先分配如下所示:
To make this more concrete, let’s take a look at the code. All we’ve changed is the order in which operations get assigned to each of our two streams, so this will be strictly a copy-and-paste optimization. Everything else in the application will remain unchanged, which means that our improvements are localized to the for() loop. The new, breadth-first assignment to the two streams looks like this:
如果我们假设内存副本和内核执行在执行时间上大致相当,我们的新执行时间线将如图10.5所示。引擎间的依赖关系用箭头突出显示,只是为了说明它们仍然对这个新的调度顺序感到满意。
If we assume that our memory copies and kernel executions are roughly comparable in execution time, our new execution timeline will look like Figure 10.5. The interengine dependencies are highlighted with arrows simply to illustrate that they are still satisfied with this new scheduling order.
Figure 10.5 Execution timeline of the improved example with arrows indicating interengine dependencies
因为我们已经跨流按广度优先对操作进行了排队,所以我们不再拥有c阻塞流 1 的初始内存副本 和 的a流0 的副本b。这允许 GPU 并行执行副本和内核,从而使我们的应用程序运行速度显着加快。新代码的运行时间为 48 毫秒,比我们最初的简单双流实现提高了 21%。对于几乎可以重叠所有计算和内存副本的应用程序,您可以将性能提高近两倍,因为副本和内核引擎将始终启动。
Because we have queued our operations breadth-first across streams, we no longer have stream 0’s copy of c blocking stream 1’s initial memory copies of a and b. This allows the GPU to execute copies and kernels in parallel, allowing our application to run significantly faster. The new code runs in 48ms, a 21 percent improvement over our original, naïve double-stream implementation. For applications that can overlap nearly all computation and memory copies, you can approach a nearly twofold improvement in performance because the copy and kernel engines will be cranking the entire time.
在本章中,我们研究了一种在 CUDA C 应用程序中实现任务级并行性的方法。通过使用两个(或更多)CUDA 流,我们可以允许 GPU 在主机和 GPU 之间执行复制的同时同时执行内核。不过,当我们努力做到这一点时,我们需要注意两件事。首先,需要使用 来分配涉及的主机内存,cudaHostAlloc()因为我们将使用 来对内存副本进行排队cudaMemcpyAsync(),并且需要使用固定缓冲区来执行异步副本。其次,我们需要意识到,向流添加操作的顺序将影响我们实现副本和内核执行重叠的能力。一般准则涉及广度优先或循环将工作分配给您打算使用的流。如果您不了解硬件队列的工作原理,这可能是违反直觉的,因此在编写自己的应用程序时记住这一点是一件好事。
In this chapter, we looked at a method for achieving a kind of task-level parallelism in CUDA C applications. By using two (or more) CUDA streams, we can allow the GPU to simultaneously execute a kernel while performing a copy between the host and GPU. We need to be careful about two things when we endeavor to do this, though. First, the host memory involved needs to be allocated using cudaHostAlloc() since we will queue our memory copies with cudaMemcpyAsync(), and asynchronous copies need to be performed with pinned buffers. Second, we need to be aware that the order in which we add operations to our streams will affect our capacity to achieve overlapping of copies and kernel executions. The general guideline involves a breadth-first, or round-robin, assignment of work to the streams you intend to use. This can be counterintuitive if you don’t understand how the hardware queuing works, so it’s a good thing to remember when you go about writing your own applications.
有一句老话是这么说的:“唯一比在 GPU 上计算更好的事情就是在两个 GPU 上计算。”近年来,包含多个图形处理器的系统变得越来越普遍。当然,在某些方面,多 GPU 系统与多 CPU 系统相似,因为它们仍然与常见的系统配置相距甚远,但系统中最终会很容易出现多个 GPU。 GeForce GTX 295 等产品在一张卡上包含两个 GPU。 NVIDIA Tesla S1070 包含多达四个支持 CUDA 的图形处理器。围绕最新 NVIDIA 芯片组构建的系统将在主板上集成一个支持 CUDA 的 GPU。在其中一个 PCI Express 插槽中添加一个独立的 NVIDIA GPU 将使该系统成为多 GPU。这两种情况都不是很牵强,因此我们最好学会利用具有多个 GPU 的系统的资源。
There is an old saying that goes something like this: “The only thing better than computing on a GPU is computing on two GPUs.” Systems containing multiple graphics processors have become more and more common in recent years. Of course, in some ways multi-GPU systems are similar to multi-CPU systems in that they are still far from the common system configuration, but it has gotten quite easy to end up with more than one GPU in your system. Products such as the GeForce GTX 295 contain two GPUs on a single card. NVIDIA’s Tesla S1070 contains a whopping four CUDA-capable graphics processors in it. Systems built around a recent NVIDIA chipset will have an integrated, CUDA-capable GPU on the motherboard. Adding a discrete NVIDIA GPU in one of the PCI Express slots will make this system multi-GPU. Neither of these scenarios is very farfetched, so we would be best served by learning to exploit the resources of a system with multiple GPUs in it.
通过本章的课程,您将完成以下任务:
Through the course of this chapter, you will accomplish the following:
• 您将学习如何分配和使用零拷贝内存。
• You will learn how to allocate and use zero-copy memory.
• 您将学习如何在同一应用程序中使用多个GPU。
• You will learn how to use multiple GPUs within the same application.
• 您将学习如何分配和使用便携式固定内存。
• You will learn how to allocate and use portable pinned memory.
在第 10 章中,我们研究了固定或页面锁定内存,这是一种新型主机内存,可以保证缓冲区永远不会从物理内存中换出。如果您还记得的话,我们通过调用cudaHostAlloc()并传递cudaHostAllocDefault以获取默认的固定内存来分配此内存。我们承诺在下一章中,您将看到其他更令人兴奋的方法来分配固定内存。假设这是您继续阅读的唯一原因,您会很高兴知道等待已经结束。cudaHostAllocMapped可以传递标志而不是cudaHostAllocDefault。使用 分配的主机内存的固定cudaHostAllocMapped方式与使用 分配的内存的固定方式相同,特别是它不能在物理内存中调出或重新定位。但是,除了使用来自主机的内存在 GPU 之间进行内存复制之外,这种新型主机内存还允许我们违反第 3 章中介绍的有关主机内存的首要规则之一:我们可以直接访问该主机内存来自 CUDA C 内核。由于该内存不需要与 GPU 之间进行复制,因此我们将其称为零复制内存。cudaHostAllocDefault
In Chapter 10, we examined pinned or page-locked memory, a new type of host memory that came with the guarantee that the buffer would never be swapped out of physical memory. If you recall, we allocated this memory by making a call to cudaHostAlloc() and passing cudaHostAllocDefault to get default, pinned memory. We promised that in the next chapter, you would see other more exciting means by which you can allocate pinned memory. Assuming that this is the only reason you’ve continued reading, you will be glad to know that the wait is over. The flag cudaHostAllocMapped can be passed instead of cudaHostAllocDefault. The host memory allocated using cudaHostAllocMapped is pinned in the same sense that memory allocated with cudaHostAllocDefault is pinned, specifically that it cannot be paged out of or relocated within physical memory. But in addition to using this memory from the host for memory copies to and from the GPU, this new kind of host memory allows us to violate one of the first rules we presented in Chapter 3 concerning host memory: We can access this host memory directly from within CUDA C kernels. Because this memory does not require copies to and from the GPU, we refer to it as zero-copy memory.
通常,我们的 GPU 仅访问 GPU 内存,而我们的 CPU 仅访问主机内存。但在某些情况下,最好打破这些规则。为了了解让 GPU 操作主机内存更好的实例,我们将重新审视我们最喜欢的简化:矢量点积。如果您已经读完整本书,您可能还记得我们第一次尝试点积。我们将两个输入向量复制到GPU,执行计算,将中间结果复制回主机,并在CPU上完成计算。
Typically, our GPU accesses only GPU memory, and our CPU accesses only host memory. But in some circumstances, it’s better to break these rules. To see an instance where it’s better to have the GPU manipulate host memory, we’ll revisit our favorite reduction: the vector dot product. If you’ve managed to read this entire book, you may recall our first attempt at the dot product. We copied the two input vectors to the GPU, performed the computation, copied the intermediate results back to the host, and completed the computation on the CPU.
在此版本中,我们将跳过输入到 GPU 的显式副本,而是使用零副本内存直接从 GPU 访问数据。此版本的点积的设置与我们的固定内存测试完全相同。具体来说,我们将编写两个函数;一个将使用标准主机内存执行测试,另一个将使用零复制内存来保存输入和输出缓冲区,在 GPU 上完成缩减。首先让我们看一下点积的标准主机内存版本。我们以通常的方式开始,创建定时事件,分配输入和输出缓冲区,并用数据填充输入缓冲区。
In this version, we’ll skip the explicit copies of our input up to the GPU and instead use zero-copy memory to access the data directly from the GPU. This version of dot product will be set up exactly like our pinned memory test. Specifically, we’ll write two functions; one will perform the test with standard host memory, and the other will finish the reduction on the GPU using zero-copy memory to hold the input and output buffers. First let’s take a look at the standard host memory version of the dot product. We start in the usual fashion by creating timing events, allocating input and output buffers, and filling our input buffers with data.
分配和创建数据后,我们可以开始计算。我们启动计时器,将输入复制到 GPU,执行点积内核,并将部分结果复制回主机。
After the allocations and data creation, we can begin the computations. We start our timer, copy our inputs to the GPU, execute the dot product kernel, and copy the partial results back to the host.
现在我们需要像第 5 章中那样完成 CPU 上的计算。在执行此操作之前,我们将停止事件计时器,因为它仅测量在 GPU 上执行的工作:
Now we need to finish up our computations on the CPU as we did in Chapter 5. Before doing this, we’ll stop our event timer because it only measures work that’s being performed on the GPU:
最后,我们对部分结果求和并释放输入和输出缓冲区。
Finally, we sum our partial results and free our input and output buffers.
使用零拷贝内存的版本将非常相似,但内存分配除外。因此,我们首先分配输入和输出,像以前一样用数据填充输入内存:
The version that uses zero-copy memory will be remarkably similar, with the exception of memory allocation. So, we start by allocating our input and output, filling the input memory with data as before:
与第 10 章一样,我们cudaHostAlloc()再次看到实际操作,尽管我们现在使用flags参数来指定的不仅仅是默认行为。该标志cudaHostAllocMapped告诉运行时我们打算从 GPU 访问此缓冲区。换句话说,这个标志使我们的缓冲区成为零拷贝。对于两个输入缓冲区,我们指定标志cudaHostAllocWriteCombined。此标志指示运行时应将缓冲区分配为与 CPU 缓存相关的写组合。该标志不会改变我们应用程序中的功能,但代表了对仅由 GPU 读取的缓冲区的重要性能增强。然而,在 CPU 还需要从缓冲区执行读取的情况下,写组合内存的效率可能非常低,因此在做出此决定时,您必须考虑应用程序可能的访问模式。
As with Chapter 10, we see cudaHostAlloc() in action again, although we’re now using the flags argument to specify more than just default behavior. The flag cudaHostAllocMapped tells the runtime that we intend to access this buffer from the GPU. In other words, this flag is what makes our buffer zero-copy. For the two input buffers, we specify the flag cudaHostAllocWriteCombined. This flag indicates that the runtime should allocate the buffer as write-combined with respect to the CPU cache. This flag will not change functionality in our application but represents an important performance enhancement for buffers that will be read only by the GPU. However, write-combined memory can be extremely inefficient in scenarios where the CPU also needs to perform reads from the buffer, so you will have to consider your application’s likely access patterns when making this decision.
由于我们已经使用 flag 分配了主机内存cudaHostAllocMapped,因此可以从 GPU 访问缓冲区。但是,GPU 具有与 CPU 不同的虚拟内存空间,因此在 GPU 上访问缓冲区时,与在 CPU 上访问缓冲区时,它们将具有不同的地址。调用cudaHostAlloc()返回内存的CPU指针,因此我们需要调用cudaHostGetDevicePointer()才能获取内存的有效GPU指针。这些指针将被传递到内核,然后由 GPU 使用来读取和写入我们的主机分配:
Since we’ve allocated our host memory with the flag cudaHostAllocMapped, the buffers can be accessed from the GPU. However, the GPU has a different virtual memory space than the CPU, so the buffers will have different addresses when they’re accessed on the GPU as compared to the CPU. The call to cudaHostAlloc() returns the CPU pointer for the memory, so we need to call cudaHostGetDevicePointer() in order to get a valid GPU pointer for the memory. These pointers will be passed to the kernel and then used by the GPU to read from and write to our host allocations:
有了有效的设备指针,我们就可以启动计时器并启动内核了。
With valid device pointers in hand, we’re ready to start our timer and launch our kernel.
尽管指针dev_a、dev_b和dev_partial_c都驻留在主机上,但由于我们对 的调用,它们在我们的内核中看起来就像是 GPU 内存一样cudaHostGetDevicePointer()。由于我们的部分结果已经在主机上,因此我们不需要费心cudaMemcpy()从设备上获取结果。但是,您会注意到我们通过调用 来同步 CPU 和 GPU cudaThreadSynchronize()。零拷贝内存的内容在内核执行期间未定义,可能会更改其内容。同步后,我们确定内核已完成,并且我们的零复制缓冲区包含结果,因此我们可以像之前一样停止计时器并完成 CPU 上的计算。
Even though the pointers dev_a, dev_b, and dev_partial_c all reside on the host, they will look to our kernel as if they are GPU memory, thanks to our calls to cudaHostGetDevicePointer(). Since our partial results are already on the host, we don’t need to bother with a cudaMemcpy() from the device. However, you will notice that we’re synchronizing the CPU with the GPU by calling cudaThreadSynchronize(). The contents of zero-copy memory are undefined during the execution of a kernel that potentially makes changes to its contents. After synchronizing, we’re sure that the kernel has completed and that our zero-copy buffer contains the results so we can stop our timer and finish the computation on the CPU as we did before.
cudaHostAlloc()点积版本中唯一剩下的就是清理。
The only thing remaining in the cudaHostAlloc() version of the dot product is cleanup.
您会注意到,无论我们使用什么标志cudaHostAlloc(),内存总是以相同的方式释放。具体来说,调用 to 就cudaFreeHost()可以解决问题。
You will notice that no matter what flags we use with cudaHostAlloc(), the memory always gets freed in the same way. Specifically, a call to cudaFreeHost() does the trick.
就是这样!剩下的就是看看如何main()将所有这些联系在一起。我们需要检查的第一件事是我们的设备是否支持映射主机内存。我们执行此操作的方式与上一章中检查设备重叠的方式相同,即调用cudaGetDeviceProperties().
And that’s that! All that remains is to look at how main() ties all of this together. The first thing we need to check is whether our device supports mapping host memory. We do this the same way we checked for device overlap in the previous chapter, with a call to cudaGetDeviceProperties().
假设我们的设备支持零拷贝内存,我们将运行时置于能够为我们分配零拷贝缓冲区的状态。我们通过调用cudaSetDeviceFlags()并传递标志cudaDeviceMapHost来指示我们希望允许设备映射主机内存来完成此操作:
Assuming that our device supports zero-copy memory, we place the runtime into a state where it will be able to allocate zero-copy buffers for us. We accomplish this by a call to cudaSetDeviceFlags() and by passing the flag cudaDeviceMapHost to indicate that we want the device to be allowed to map host memory:
这就是真正要做的一切main()。我们运行两个测试,显示经过的时间,然后退出应用程序:
That’s really all there is to main(). We run our two tests, display the elapsed time, and exit the application:
内核本身与第 5 章相比没有变化,但为了完整起见,这里是完整的:
The kernel itself is unchanged from Chapter 5, but for the sake of completeness, here it is in its entirety:
我们应该期望从使用零拷贝内存中获得什么?对于独立 GPU 和集成 GPU,这个问题的答案是不同的。离散 GPU是具有自己专用 DRAM 的图形处理器,通常位于与 CPU 不同的电路板上。例如,如果您曾经在桌面上安装过显卡,则该 GPU 就是独立 GPU。集成 GPU是内置于系统芯片组中的图形处理器,通常共享常规图形处理器系统内存与CPU。许多最近使用 NVIDIA nForce 媒体和通信处理器 (MCP) 构建的系统都包含支持 CUDA 的集成 GPU。除了 nForce MCP 之外,所有基于 NVIDIA 新 ION 平台的上网本、笔记本和台式电脑都包含集成的、支持 CUDA 的 GPU。对于集成 GPU,使用零拷贝内存始终会带来性能优势,因为内存无论如何都会与主机物理共享。将缓冲区声明为零复制的唯一作用是防止不必要的数据复制。但请记住,没有什么是免费的,并且零复制缓冲区仍然受到与所有固定内存分配相同的限制:每个固定分配都会占用系统的可用物理内存,这最终会降低系统性能。
What should we expect to gain from using zero-copy memory? The answer to this question is different for discrete GPUs and integrated GPUs. Discrete GPUs are graphics processors that have their own dedicated DRAMs and typically sit on separate circuit boards from the CPU. For example, if you have ever installed a graphics card into your desktop, this GPU is a discrete GPU. Integrated GPUs are graphics processors built into a system’s chipset and usually share regular system memory with the CPU. Many recent systems built with NVIDIA’s nForce media and communications processors (MCPs) contain CUDA-capable integrated GPUs. In addition to nForce MCPs, all the netbook, notebook, and desktop computers based on NVIDIA’s new ION platform contain integrated, CUDA-capable GPUs. For integrated GPUs, the use of zero-copy memory is always a performance win because the memory is physically shared with the host anyway. Declaring a buffer as zero-copy has the sole effect of preventing unnecessary copies of data. But remember that nothing is free and that zero-copy buffers are still constrained in the same way that all pinned memory allocations are constrained: Each pinned allocation carves into the system’s available physical memory, which will eventually degrade system performance.
在输入和输出仅使用一次的情况下,当使用零拷贝内存和独立 GPU 时,我们甚至会看到性能增强。由于 GPU 的设计擅长隐藏与内存访问相关的延迟,因此通过 PCI Express 总线执行读写操作可以通过这种机制在一定程度上得到缓解,从而产生显着的性能优势。但由于零拷贝内存没有缓存在 GPU 上,在内存被多次读取的情况下,我们最终将付出巨大的代价,而通过简单地将数据复制到 GPU 就可以避免这种情况。
In cases where inputs and outputs are used exactly once, we will even see a performance enhancement when using zero-copy memory with a discrete GPU. Since GPUs are designed to excel at hiding the latencies associated with memory access, performing reads and writes over the PCI Express bus can be mitigated to some degree by this mechanism, yielding a noticeable performance advantage. But since the zero-copy memory is not cached on the GPU, in situations where the memory gets read multiple times, we will end up paying a large penalty that could be avoided by simply copying the data to the GPU first.
如何确定 GPU 是集成 GPU 还是独立 GPU?好吧,你可以打开你的计算机看看,但这个解决方案对于你的 CUDA C 应用程序来说是相当行不通的。毫不奇怪,您的代码可以通过查看 . 返回的结构来检查 GPU 的此属性cudaGetDeviceProperties()。该结构有一个名为 的字段integrated,true如果设备是集成 GPU,则该字段为 ,false否则为 。
How do you determine whether a GPU is integrated or discrete? Well, you can open up your computer and look, but this solution is fairly unworkable for your CUDA C application. Your code can check this property of a GPU by, not surprisingly, looking at the structure returned by cudaGetDeviceProperties(). This structure has a field named integrated, which will be true if the device is an integrated GPU and false if it’s not.
由于我们的点积应用程序满足“读取和/或写入一次”约束,因此在使用零拷贝内存运行时,它可能会获得性能提升。事实上,它的性能确实略有提升。在 GeForce GTX 285 上,迁移到零拷贝内存后,执行时间提高了 45% 以上,从 98.1 毫秒降至 52.1 毫秒。 GeForce GTX 280 也有类似的改进,速度从 143.9 毫秒提高到 94.7 毫秒,提高了 34%。当然,由于计算与带宽的比率不同,以及芯片组之间有效 PCI Express 带宽的变化,不同的 GPU 将表现出不同的性能特征。
Since our dot product application satisfies the “read and/or write exactly once” constraint, it’s possible that it will enjoy a performance boost when run with zero-copy memory. And in fact, it does enjoy a slight boost in performance. On a GeForce GTX 285, the execution time improves by more than 45 percent, dropping from 98.1ms to 52.1ms when migrated to zero-copy memory. A GeForce GTX 280 enjoys a similar improvement, speeding up by 34 percent from 143.9 ms to 94.7ms. Of course, different GPUs will exhibit different performance characteristics because of varying ratios of computation to bandwidth, as well as because of variations in effective PCI Express bandwidth across chipsets.
在上一节中,我们提到设备是集成 GPU 或独立 GPU,前者内置于系统芯片组中,后者通常是 PCI Express 插槽中的扩展卡。越来越多的系统同时包含集成和独立 GPU,这意味着它们还具有多个支持 CUDA 的处理器。 NVIDIA 还销售包含多个 GPU 的产品,例如 GeForce GTX 295。 GeForce GTX 295 虽然在物理上占用一个扩展插槽,但对于 CUDA 应用程序来说将显示为两个独立的 GPU。此外,用户还可以将多个 GPU 添加到单独的 PCI Express 插槽,并使用 NVIDIA 的可扩展链路接口(SLI) 技术将它们与桥连接。由于这些趋势,在具有多个图形处理器的系统上运行 CUDA 应用程序已变得相对常见。由于我们的 CUDA 应用程序一开始就倾向于高度并行化,因此如果我们能够使用系统中的每个 CUDA 设备来实现最大吞吐量,那就太好了。那么,让我们弄清楚如何实现这一目标。
In the previous section, we mentioned how devices are either integrated or discrete GPUs, where the former is built into the system’s chipset and the latter is typically an expansion card in a PCI Express slot. More and more systems contain both integrated and discrete GPUs, meaning that they also have multiple CUDA-capable processors. NVIDIA also sells products, such as the GeForce GTX 295, that contain more than one GPU. A GeForce GTX 295, while physically occupying a single expansion slot, will appear to your CUDA applications as two separate GPUs. Furthermore, users can also add multiple GPUs to separate PCI Express slots, connecting them with bridges using NVIDIA’s scalable link interface (SLI) technology. As a result of these trends, it has become relatively common to have a CUDA application running on a system with multiple graphics processors. Since our CUDA applications tend to be very parallelizable to begin with, it would be excellent if we could use every CUDA device in the system to achieve maximum throughput. So, let’s figure out how we can accomplish this.
为了避免学习新的示例,让我们将点积转换为使用多个 GPU。为了让我们的生活更轻松,我们将在单个结构中汇总计算点积所需的所有数据。您很快就会明白为什么这会让我们的生活更轻松。
To avoid learning a new example, let’s convert our dot product to use multiple GPUs. To make our lives easier, we will summarize all the data necessary to compute a dot product in a single structure. You’ll see momentarily exactly why this will make our lives easier.
该结构包含将在其上计算点积的设备的标识;它包含输入缓冲区的大小以及指向两个输入a和 的指针b。最后,它有一个条目来存储计算为a和的点积的值b。
This structure contains the identification for the device on which the dot product will be computed; it contains the size of the input buffers as well as pointers to the two inputs a and b. Finally, it has an entry to store the value computed as the dot product of a and b.
要使用NGPU,我们首先想确切地知道N我们正在处理的值是什么。因此,我们通过调用cudaGetDeviceCount()in来启动我们的应用程序以确定我们的系统中安装了多少个支持 CUDA 的处理器。
To use N GPUs, we first would like to know exactly what value of N we’re dealing with. So, we start our application with a call to cudaGetDeviceCount() in order to determine how many CUDA-capable processors have been installed in our system.
此示例旨在显示多 GPU 使用情况,因此您会注意到,如果系统只有一个 CUDA 设备,我们只需退出(这并不是说有什么问题)。由于显而易见的原因,不鼓励将其作为最佳实践。为了使事情尽可能简单,我们将为我们的输入分配标准主机内存,并按照我们过去所做的方式用数据填充它们。
This example is designed to show multi-GPU usage, so you’ll notice that we simply exit if the system has only one CUDA device (not that there’s anything wrong with that). This is not encouraged as a best practice for obvious reasons. To keep things as simple as possible, we’ll allocate standard host memory for our inputs and fill them with data exactly how we’ve done in the past.
我们现在准备好深入研究多 GPU 代码。将多个 GPU 与 CUDA 运行时 API 结合使用的技巧是认识到每个 GPU 需要由不同的 CPU 线程控制。由于我们之前只使用过单个 GPU,因此无需担心这一点。我们已将许多令人烦恼的多线程代码移至我们的辅助代码文件book.h.有了这段代码,我们所需要做的就是用执行该操作所需的数据填充一个结构计算。尽管系统可以拥有大于 1 个的任意数量的 GPU,但为了清楚起见,我们将仅使用其中两个:
We’re now ready to dive into the multi-GPU code. The trick to using multiple GPUs with the CUDA runtime API is realizing that each GPU needs to be controlled by a different CPU thread. Since we have used only a single GPU before, we haven’t needed to worry about this. We have moved a lot of the annoyance of multithreaded code to our file of auxiliary code, book.h. With this code tucked away, all we need to do is fill a structure with data necessary to perform the computations. Although the system could have any number of GPUs greater than one, we will use only two of them for clarity:
首先,我们将其中一个DataStruct变量传递给我们命名为 的实用函数start_thread()。我们还传递start_thread()一个指向由新创建的线程调用的函数的指针;这个例子的线程函数被称为routine()。该函数start_thread()将创建一个新线程,然后调用指定的函数,并将 传递DataStruct给该函数。另一种调用是routine()从默认应用程序线程进行的(因此我们只创建了一个附加线程)。
To proceed, we pass one of the DataStruct variables to a utility function we’ve named start_thread(). We also pass start_thread() a pointer to a function to be called by the newly created thread; this example’s thread function is called routine(). The function start_thread() will create a new thread that then calls the specified function, passing the DataStruct to this function. The other call to routine() gets made from the default application thread (so we’ve created only one additional thread).
在继续之前,我们让主应用程序线程通过调用 来等待另一个线程完成end_thread()。
Before we proceed, we have the main application thread wait for the other thread to finish by calling end_thread().
由于此时两个线程都已完成main(),因此可以安全地清理并显示结果。
Since both threads have completed at this point in main(), it’s safe to clean up and display the result.
请注意,我们将每个线程计算的结果相加。这是我们点积缩减的最后一步。在另一种算法中,多个结果的这种组合可能涉及其他步骤。事实上,在某些应用中,两个GPU可能在完全不同的数据集上执行完全不同的代码。为了简单起见,我们的点积示例中的情况并非如此。
Notice that we sum the results computed by each thread. This is the last step in our dot product reduction. In another algorithm, this combination of multiple results may involve other steps. In fact, in some applications, the two GPUs may be executing completely different code on completely different data sets. For simplicity’s sake, this is not the case in our dot product example.
由于点积例程与您见过的其他版本相同,因此我们将在本节中省略它。然而,其中的内容routine()可能会令人感兴趣。我们声明routine()为获取并返回 a ,void*以便您可以start_thread()通过线程函数的任意实现重用代码。尽管我们很乐意相信这个想法,但它是 C 中回调函数的相当标准的过程:
Since the dot product routine is identical to the other versions you’ve seen, we’ll omit it from this section. However, the contents of routine() may be of interest. We declare routine() as taking and returning a void* so that you can reuse the start_thread() code with arbitrary implementations of a thread function. Although we’d love to take credit for this idea, it’s fairly standard procedure for callback functions in C:
每个线程都调用cudaSetDevice(),并且每个线程都将不同的 ID 传递给该函数。因此,我们知道每个线程将操作不同的 GPU。这些 GPU 可能具有相同的性能,如双 GPU GeForce GTX 295,也可能是不同的 GPU,如同时具有集成 GPU 和独立 GPU 的系统中的情况。这些细节对我们的应用程序并不重要,尽管您可能会对它们感兴趣。特别是,如果您依赖一定的最低计算能力来启动内核,或者您非常希望在系统的 GPU 上平衡应用程序的负载,那么这些详细信息将非常有用。如果 GPU 不同,您将需要执行一些操作努力对计算进行分区,以便每个 GPU 占用的时间大致相同。然而,就本例中的目的而言,这些都是琐碎的细节,我们不必担心。
Each thread calls cudaSetDevice(), and each passes a different ID to this function. As a result, we know each thread will be manipulating a different GPU. These GPUs may have identical performance, as with the dual-GPU GeForce GTX 295, or they may be different GPUs as would be the case in a system that has both an integrated GPU and a discrete GPU. These details are not important to our application, though they might be of interest to you. Particularly, these details prove useful if you depend on a certain minimum compute capability to launch your kernels or if you have a serious desire to load balance your application across the system’s GPUs. If the GPUs are different, you will need to do some work to partition the computations so that each GPU is occupied for roughly the same amount of time. For our purposes in this example, however, these are piddling details with which we won’t worry.
除了调用指定cudaSetDevice()我们打算使用哪个 CUDA 设备之外,此实现与第 11.2.1 节:零复制点积中的routine()普通版本非常相似。malloc_test()我们为输入的 GPU 副本分配缓冲区,为部分结果分配缓冲区,然后cudaMemcpy()为 GPU 的每个输入数组分配缓冲区。
Outside the call to cudaSetDevice() to specify which CUDA device we intend to use, this implementation of routine() is remarkably similar to the vanilla malloc_test() from Section 11.2.1: Zero-Copy Dot Product. We allocate buffers for our GPU copies of the input and a buffer for our partial results followed by a cudaMemcpy() of each input array to the GPU.
然后,我们启动点积内核,将结果复制回来,并在 CPU 上完成计算。
We then launch our dot product kernel, copy the results back, and finish the computation on the CPU.
像往常一样,我们清理 GPU 缓冲区并返回我们returnValue在DataStruct.
As usual, we clean up our GPU buffers and return the dot product we’ve computed in the returnValue field of our DataStruct.
因此,当我们认真思考时,除了主机线程管理问题之外,使用多个 GPU 并不比使用单个 GPU 困难多少。使用我们的帮助程序代码创建一个线程并在该线程上执行一个函数,这变得更加易于管理。如果您有自己的线程库,您应该可以在自己的应用程序中随意使用它们。您只需要记住每个 GPU 都有自己的线程,其他一切都是奶油奶酪。
So when we get down to it, outside of the host thread management issue, using multiple GPUs is not too much tougher than using a single GPU. Using our helper code to create a thread and execute a function on that thread, this becomes significantly more manageable. If you have your own thread libraries, you should feel free to use them in your own applications. You just need to remember that each GPU gets its own thread, and everything else is cream cheese.
使用多个 GPU 的最后一个重要部分涉及固定内存的使用。我们在第 10 章中了解到,固定内存实际上是主机内存,它将其页面锁定在物理内存中,以防止其被调出或重新定位。然而,事实证明页面只能显示为固定到单个 CPU 线程。也就是说,如果任何线程将它们分配为固定内存,它们将保持页面锁定,但它们只会对分配它们的线程显示为页面锁定。如果指向该内存的指针在线程之间共享,则其他线程会将缓冲区视为标准的可分页数据。
The last important piece to using multiple GPUs involves the use of pinned memory. We learned in Chapter 10 that pinned memory is actually host memory that has its pages locked in physical memory to prevent it from being paged out or relocated. However, it turns out that pages can appear pinned to a single CPU thread only. That is, they will remain page-locked if any thread has allocated them as pinned memory, but they will only appear page-locked to the thread that allocated them. If the pointer to this memory is shared between threads, the other threads will see the buffer as standard, pageable data.
作为此行为的副作用,当未分配固定缓冲区的线程尝试使用它时cudaMemcpy(),复制将以标准可分页内存速度执行。正如我们在第 10 章中看到的,该速度大约是可达到的最大传输速度的 50%。更糟糕的是,如果线程尝试将调用cudaMemcpyAsync()排队到 CUDA 流中,则此操作将失败,因为它需要固定缓冲区才能继续。由于缓冲区似乎可以从未分配它的线程中分页,因此该调用会惨烈地死亡。即使将来也没有任何作用!
As a side effect of this behavior, when a thread that did not allocate a pinned buffer attempts to perform a cudaMemcpy() using it, the copy will be performed at standard pageable memory speeds. As we saw in Chapter 10, this speed can be roughly 50 percent of the maximum attainable transfer speed. What’s worse, if the thread attempts to enqueue a cudaMemcpyAsync() call into a CUDA stream, this operation will fail because it requires a pinned buffer to proceed. Since the buffer appears pageable from the thread that didn’t allocate it, the call dies a grisly death. Even in the future nothing works!
但这个问题有一个解决办法。我们可以将固定内存分配为可移植的,这意味着我们将被允许在主机线程之间迁移它,并允许任何线程将其视为固定缓冲区。为此,我们使用我们的信任cudaHostAlloc()来分配内存,但我们用一个新标志来调用它:cudaHostAllocPortable。该标志可以与您见过的其他标志一起使用,例如cudaHostAllocWriteCombined和cudaHostAllocMapped。这意味着您可以将主机缓冲区分配为可移植、零复制和写入组合的任意组合。
But there is a remedy to this problem. We can allocate pinned memory as portable, meaning that we will be allowed to migrate it between host threads and allow any thread to view it as a pinned buffer. To do so, we use our trusty cudaHostAlloc() to allocate the memory, but we call it with a new flag: cudaHostAllocPortable. This flag can be used in concert with the other flags you’ve seen, such as cudaHostAllocWriteCombined and cudaHostAllocMapped. This means that you can allocate your host buffers as any combination of portable, zero-copy and write-combined.
为了演示便携式固定内存,我们将增强我们的多 GPU 点产品应用程序。我们将调整我们最初的零拷贝版本的点积,因此这个版本开始时是零拷贝和多 GPU 版本的混搭。正如我们在本章中所做的那样,我们需要验证至少有两个支持 CUDA 的 GPU,并且两者都可以处理零复制缓冲区。
To demonstrate portable pinned memory, we’ll enhance our multi-GPU dot product application. We’ll adapt our original zero-copy version of the dot product, so this version begins as something of a mash-up of the zero-copy and multi-GPU versions. As we have throughout this chapter, we need to verify that there are at least two CUDA-capable GPUs and that both can handle zero-copy buffers.
在前面的示例中,我们准备开始在主机上分配内存来保存输入向量。然而,要分配便携式固定内存,有必要首先设置我们打算运行的 CUDA 设备。由于我们也打算将该设备用于零拷贝内存,因此我们cudaSetDevice()在调用之后调用cudaSetDeviceFlags(),就像我们在第 11.2.1 节:零拷贝点积中所做的那样。
In previous examples, we’d be ready to start allocating memory on the host to hold our input vectors. To allocate portable pinned memory, however, it’s necessary to first set the CUDA device on which we intend to run. Since we intend to use the device for zero-copy memory as well, we follow the cudaSetDevice() call with a call to cudaSetDeviceFlags(), as we did in Section 11.2.1: Zero-Copy Dot Product.
在本章前面,我们调用了调用,cudaSetDevice()但直到我们已经分配了内存并创建了线程。不过,使用 分配可移植页锁定内存的要求之一cudaHostAlloc()是我们首先通过调用 来初始化设备cudaSetDevice()。您还会注意到,我们将新学习的标志 传递cudaHostAllocPortable给这两个分配。由于这些是在调用后分配的cudaSetDevice(0),如果我们没有指定它们是可移植分配的,则只有 CUDA 设备零会将这些缓冲区视为固定内存。
Earlier in this chapter, we called cudaSetDevice() but not until we had already allocated our memory and created our threads. One of the requirements of allocating portable page-locked memory with cudaHostAlloc(), though, is that we have initialized the device first by calling cudaSetDevice(). You will also notice that we pass our newly learned flag, cudaHostAllocPortable, to both allocations. Since these were allocated after calling cudaSetDevice(0), only CUDA device zero would see these buffers as pinned memory if we had not specified that they were to be portable allocations.
我们像过去一样继续应用程序,为输入向量生成数据并准备我们的结构,就像我们在第 11.3 节DataStruct:使用多个 GPU中的多 GPU 示例中所做的那样。
We continue the application as we have in the past, generating data for our input vectors and preparing our DataStruct structures as we did in the multi-GPU example in Section 11.3: Using Multiple GPUs.
然后,我们可以创建辅助线程并调用routine()以开始在每个设备上进行计算。
We can then create our secondary thread and call routine() to begin computing on each device.
因为我们的主机内存是由 CUDA 运行时分配的,所以我们用来cudaFreeHost()释放它。除了不再打电话free(),我们已经看到了所有可看的东西main()。
Because our host memory was allocated by the CUDA runtime, we use cudaFreeHost() to free it. Other than no longer calling free(), we have seen all there is to see in main().
为了在我们的多 GPU 应用程序中支持便携式固定内存和零复制内存,我们需要对routine().第一个有点微妙,无论如何这都不应该是显而易见的。
To support portable pinned memory and zero-copy memory in our multi-GPU application, we need to make two notable changes in the code for routine(). The first is a bit subtle, and in no way should this have been obvious.
您可能还记得,在此代码的多 GPU 版本中,我们需要调用cudaSetDevice()in 来routine()确保每个参与线程控制不同的 GPU。另一方面,在这个例子中我们已经cudaSetDevice()从主线程进行了调用。我们这样做是为了在main().结果,我们只想调用cudaSetDevice()以及cudaSetDeviceFlags()我们尚未进行此调用的设备上。也就是说,如果deviceID不为零,我们就调用这两个函数。尽管在设备零上简单地重复这些调用会产生更干净的代码,但事实证明这实际上是一个错误。一旦在特定线程上设置了设备,cudaSetDevice()即使传递相同的设备标识符,也无法再次调用。突出显示的if()语句帮助我们避免 CUDA 运行时中的这个令人讨厌的小语法,因此我们继续进行下一个重要的更改routine()。
You may recall in our multi-GPU version of this code, we need a call to cudaSetDevice() in routine() in order to ensure that each participating thread controls a different GPU. On the other hand, in this example we have already made a call to cudaSetDevice() from the main thread. We did so in order to allocate pinned memory in main(). As a result, we only want to call cudaSetDevice() and cudaSetDeviceFlags() on devices where we have not made this call. That is, we call these two functions if the deviceID is not zero. Although it would yield cleaner code to simply repeat these calls on device zero, it turns out that this is in fact an error. Once you have set the device on a particular thread, you cannot call cudaSetDevice() again, even if you pass the same device identifier. The highlighted if() statement helps us avoid this little nasty-gram from the CUDA runtime, so we move on to the next important change to routine().
除了使用便携式固定内存作为主机端内存之外,我们还使用零复制来直接从 GPU 访问这些缓冲区。因此,我们不再cudaMemcpy()像在原始多 GPU 应用程序中那样使用,而是像cudaHostGetDevicePointer()在零复制示例中那样获取主机内存的有效设备指针。但是,您会注意到我们使用标准 GPU 内存来存储部分结果。与往常一样,该内存是使用 分配的cudaMalloc()。
In addition to using portable pinned memory for the host-side memory, we are using zero-copy in order to access these buffers directly from the GPU. Consequently, we no longer use cudaMemcpy() as we did in the original multi-GPU application, but we use cudaHostGetDevicePointer() to get valid device pointers for the host memory as we did in the zero-copy example. However, you will notice that we use standard GPU memory for the partial results. As always, this memory gets allocated using cudaMalloc().
此时,我们已经准备就绪,因此我们启动内核并将结果从 GPU 复制回来。
At this point, we’re pretty much ready to go, so we launch our kernel and copy our results back from the GPU.
正如我们在点积示例中一贯所做的那样,我们通过对 CPU 上的部分结果求和、释放临时存储并返回到 来得出结论main()。
We conclude as we always have in our dot product example by summing our partial results on the CPU, freeing our temporary storage, and returning to main().
我们已经看到了一些新类型的主机内存分配,所有这些都是通过单个调用cudaHostAlloc().使用这个入口点和一组参数标志的组合,我们可以将内存分配为零复制、可移植和/或写组合的任意组合。我们使用零拷贝 缓冲区以避免在 GPU 之间进行显式数据复制,这种策略可能会加速多种应用程序的速度。使用线程支持库,我们可以从同一应用程序操作多个 GPU,从而允许跨多个设备执行点积计算。最后,我们了解了多个 GPU 如何通过将固定内存分配为便携式固定内存来共享固定内存分配。我们的最后一个示例使用便携式固定内存、多个 GPU 和零复制缓冲区,以演示我们在第 5 章中开始尝试的点积的涡轮增压版本。随着多设备系统的普及,这些技术应该可以很好地帮助您充分利用目标平台的计算能力。
We have seen some new types of host memory allocations, all of which get allocated with a single call, cudaHostAlloc(). Using a combination of this one entry point and a set of argument flags, we can allocate memory as any combination of zero-copy, portable, and/or write-combined. We used zero-copy buffers to avoid making explicit copies of data to and from the GPU, a maneuver that potentially speeds up a wide class of applications. Using a support library for threading, we manipulated multiple GPUs from the same application, allowing our dot product computation to be performed across multiple devices. Finally, we saw how multiple GPUs could share pinned memory allocations by allocating them as portable pinned memory. Our last example used portable pinned memory, multiple GPUs, and zero-copy buffers in order to demonstrate a turbocharged version of the dot product we started toying with back in Chapter 5. As multiple-device systems gain popularity, these techniques should serve you well in harnessing the computational power of your target platform in its entirety.
恭喜!我们希望您喜欢学习 CUDA C 并尝试一些 GPU 计算。这是一次漫长的旅程,所以让我们花点时间回顾一下我们从哪里开始以及我们已经走了多少路。从 C 或 C++ 编程背景开始,我们学习了如何使用 CUDA 运行时的尖括号语法在任意数量的多处理器上轻松启动多个内核副本。我们扩展了这些概念,使用线程和块的集合,对任意大的输入进行操作。这些更复杂的启动利用 GPU 特殊的片上共享内存进行线程间通信,并且它们采用专用同步原语来确保在支持(并鼓励)成千上万个并行线程的环境中正确操作。
Congratulations! We hope you’ve enjoyed learning about CUDA C and experimenting some with GPU computing. It’s been a long trip, so let’s take a moment to review where we started and how much ground we’ve covered. Starting with a background in C or C++ programming, we’ve learned how to use the CUDA runtime’s angle bracket syntax to easily launch multiple copies of kernels across any number of multiprocessors. We expanded these concepts to use collections of threads and blocks, operating on arbitrarily large inputs. These more complex launches exploited interthread communication using the GPU’s special, on-chip shared memory, and they employed dedicated synchronization primitives to ensure correct operation in an environment that supports (and encourages) thousands upon thousands of parallel threads.
掌握了在 NVIDIA CUDA 架构上使用 CUDA C 进行并行编程的基本概念后,我们探索了 NVIDIA 提供的一些更高级的概念和 API。 GPU 的专用图形硬件被证明对于 GPU 计算很有用,因此我们学习了如何利用纹理内存来加速一些常见的内存访问模式。由于许多用户将 GPU 计算添加到其交互式图形应用程序中,因此我们探索了 CUDA C 内核与行业标准图形 API(例如 OpenGL 和 DirectX)的互操作。全局内存和共享内存上的原子操作允许安全、对公共内存位置的多线程访问。稳步进入越来越高级的主题,流使我们能够使整个系统尽可能忙碌,允许内核与主机和 GPU 之间的内存副本同时执行。最后,我们研究了分配和使用零拷贝内存来加速集成 GPU 上的应用程序的方法。此外,我们学会了初始化多个设备并分配便携式固定内存,以便编写充分利用日益常见的多 GPU 环境的 CUDA C。
Armed with basic concepts about parallel programming using CUDA C on NVIDIA’s CUDA Architecture, we explored some of the more advanced concepts and APIs that NVIDIA provides. The GPU’s dedicated graphics hardware proves useful for GPU computing, so we learned how to exploit texture memory to accelerate some common patterns of memory access. Because many users add GPU computing to their interactive graphics applications, we explored the interoperation of CUDA C kernels with industry-standard graphics APIs such as OpenGL and DirectX. Atomic operations on both global and shared memory allowed safe, multithreaded access to common memory locations. Moving steadily into more and more advanced topics, streams enabled us to keep our entire system as busy as possible, allowing kernels to execute simultaneously with memory copies between the host and GPU. Finally, we looked at the ways in which we could allocate and use zero-copy memory to accelerate applications on integrated GPUs. Moreover, we learned to initialize multiple devices and allocate portable pinned memory in order to write CUDA C that fully utilizes increasingly common, multi-GPU environments.
通过本章的课程,您将完成以下任务:
Through the course of this chapter, you will accomplish the following:
• 您将了解一些可帮助您进行CUDA C 开发的工具。
• You will learn about some of the tools available to aid your CUDA C development.
• 您将了解其他书面和代码资源,将您的 CUDA C 开发提升到新的水平。
• You will learn about additional written and code resources to take your CUDA C development to the next level.
在本书的学习过程中,我们已经依赖了 CUDA C 软件系统的几个组件。我们编写的应用程序大量使用了 CUDA C 编译器,以便将 CUDA C 内核转换为可以在 NVIDIA GPU 上执行的代码。我们还使用 CUDA 运行时来执行启动内核和与 GPU 通信背后的大部分设置和脏工作。 CUDA 运行时又使用 CUDA 驱动程序直接与系统中的硬件对话。除了我们已经详细使用过的这些组件之外,NVIDIA 还提供了许多其他软件来简化 CUDA C 应用程序的开发。本节不能很好地充当这些产品的用户手册,而只是为了告知您这些软件包的存在和实用性。
Through the course of this book, we have relied upon several components of the CUDA C software system. The applications we wrote made heavy use of the CUDA C compiler in order to convert our CUDA C kernels into code that could be executed on NVIDIA GPUs. We also used the CUDA runtime in order to perform much of the setup and dirty work behind launching kernels and communicating with the GPU. The CUDA runtime, in turn, uses the CUDA driver to talk directly to the hardware in your system. In addition to these components that we have already used at length, NVIDIA makes available a host of other software in order to ease the development of CUDA C applications. This section does not serve well as a user’s manual to these products, but rather, it aims solely to inform you of the existence and utility of these packages.
几乎可以肯定,您的开发计算机上已经安装了 CUDA Toolkit 软件集合。我们可以如此确信这一点,因为 CUDA C 编译器工具集包含了该软件包的主要组件之一。如果如果您的计算机上没有 CUDA 工具包,那么可以肯定您没有尝试编写或编译任何 CUDA C 代码。我们现在就对付你了,笨蛋!实际上,这没什么大不了的(但它确实让我们想知道您为什么阅读整本书)。另一方面,如果您已经完成了本书中的示例,那么您应该拥有我们将要讨论的库。
You almost certainly already have the CUDA Toolkit collection of software on your development machine. We can be so sure of this because the set of CUDA C compiler tools comprises one of the principal components of this package. If you don’t have the CUDA Toolkit on your machine, then it’s a veritable certainty that you haven’t tried to write or compile any CUDA C code. We’re on to you now, sucker! Actually, this is no big deal (but it does make us wonder why you’ve read this entire book). On the other hand, if you have been working through the examples in this book, then you should possess the libraries we’re about to discuss.
如果您计划在自己的应用程序中进行 GPU 计算,CUDA 工具包附带了两个非常重要的实用程序库。首先,NVIDIA 提供了一个经过调整的快速傅里叶变换库,称为CUFFT。从版本 3.0 开始,CUFFT 库支持许多有用的功能,包括以下功能:
The CUDA Toolkit comes with two very important utility libraries if you plan to pursue GPU computing in your own applications. First, NVIDIA provides a tuned Fast Fourier Transform library known as CUFFT. As of release 3.0, the CUFFT library supports a number of useful features, including the following:
• 实值和复值输入数据的一维、二维和三维变换
• One-, two-, and three-dimensional transforms of both real-valued and complex-valued input data
• 批处理并行执行多个一维变换
• Batch execution for performing multiple one-dimensional transforms in parallel
• 任意维度的 2D 和 3D 变换,尺寸范围为 2 到 16,384
• 2D and 3D transforms with sizes ranging from 2 to 16,384 in any dimension
• 输入的一维转换大小高达 800 万个元素
• 1D transforms of inputs up to 8 million elements in size
• 实值和复值数据的就地和异地转换
• In-place and out-of-place transforms for both real-valued and complex-valued data
NVIDIA 免费提供 CUFFT 库以及随附的许可证,允许在任何应用程序中使用,无论是用于个人、学术还是专业开发。
NVIDIA provides the CUFFT library free of charge with an accompanying license that allows for use in any application, regardless of whether it’s for personal, academic, or professional development.
除了快速傅里叶变换库之外,NVIDIA 还提供了一个线性代数例程库,用于实现著名的基本线性代数子程序 (BLAS) 包。该库名为CUBLAS,也是免费提供的,并且支持完整 BLAS 包的大部分子集。这包括接受单精度和双精度输入以及实值和复值数据的每个例程的版本。由于 BLAS 最初是 FORTRAN 实现的线性代数例程库,因此 NVIDIA 尝试最大限度地兼容这些实现的要求和期望。具体来说,CUBLAS 库对数组使用列优先存储布局,而不是 C 和 C++ 本身使用的行优先布局。在实践中,这是这通常不是一个问题,但它确实允许 BLAS 的当前用户调整其应用程序,以最小的努力利用 GPU 加速的 CUBLAS。 NVIDIA 还将 FORTRAN 绑定分发到 CUBLAS,以演示如何将现有 FORTRAN 应用程序链接到 CUDA 库。
In addition to a Fast Fourier Transform library, NVIDIA also provides a library of linear algebra routines that implements the well-known package of Basic Linear Algebra Subprograms (BLAS). This library, named CUBLAS, is also freely available and supports a large subset of the full BLAS package. This includes versions of each routine that accept both single- and double-precision inputs as well as real- and complex-valued data. Because BLAS was originally a FORTRAN-implemented library of linear algebra routines, NVIDIA attempts to maximize compatibility with the requirements and expectations of these implementations. Specifically, the CUBLAS library uses a column-major storage layout for arrays, rather than the row-major layout natively used by C and C++. In practice, this is not typically a concern, but it does allow for current users of BLAS to adapt their applications to exploit the GPU-accelerated CUBLAS with minimal effort. NVIDIA also distributes FORTRAN bindings to CUBLAS in order to demonstrate how to link existing FORTRAN applications to CUDA libraries.
可选的GPU 计算 SDK下载与 NVIDIA 驱动程序和 CUDA 工具包分开提供,其中包含数十个示例 GPU 计算应用程序。我们在本书前面提到了这个 SDK,因为它的示例是对我们在前 11 章中介绍的材料的极好补充。但如果您还没有看过,NVIDIA 已将这些示例调整为不同级别的 CUDA C 能力,并将它们传播到广泛的主题材料中。这些样本大致分为以下几部分:
Available separately from the NVIDIA drivers and CUDA Toolkit, the optional GPU Computing SDK download contains a package of dozens and dozens of sample GPU computing applications. We mentioned this SDK earlier in the book because its samples serve as an excellent complement to the material we’ve covered in the first 11 chapters. But if you haven’t taken a look yet, NVIDIA has geared these samples toward varying levels of CUDA C competency as well as spreading them over a broad spectrum of subject material. The samples are roughly categorized into the following sections:
CUDA 基本主题
CUDA Basic Topics
CUDA 高级主题
CUDA Advanced Topics
CUDA 系统集成
CUDA Systems Integration
数据并行算法
Data-Parallel Algorithms
图形互操作性
Graphics Interoperability
质地
Texture
绩效策略
Performance Strategies
线性代数
Linear Algebra
图像/视频处理
Image/Video Processing
计算金融
Computational Finance
数据压缩
Data Compression
基于物理的模拟
Physically-Based Simulation
这些示例适用于 CUDA C 运行的任何平台,并且可以作为您自己的应用程序的绝佳起点。对于在其中一些领域拥有丰富经验的读者,我们警告您不要期望在 NVIDIA 中看到您最喜欢的算法的最先进的实现GPU计算SDK。这些代码示例不应被视为具有生产价值的库代码,而应被视为功能性 CUDA C 程序的教育插图,与本书中的示例不同。
The examples work on any platform that CUDA C works on and can serve as excellent jumping-off points for your own applications. For readers who have considerable experience in some of these areas, we warn you against expecting to see state-of-the-art implementations of your favorite algorithms in the NVIDIA GPU Computing SDK. These code samples should not be treated as production-worthy library code but rather as educational illustrations of functioning CUDA C programs, not unlike the examples in this book.
除了 CUFFT 和 CUBLAS 库中提供的例程之外,NVIDIA 还维护一个用于执行 CUDA 加速数据处理的函数库,称为 NVIDIA 性能基元 (NPP)。目前,NPP 的初始功能集专门专注于图像和视频处理,广泛适用于这些领域的开发人员。 NVIDIA 打算让 NPP 随着时间的推移不断发展,以解决更广泛领域中更多的计算任务。如果您对高性能成像或视频应用感兴趣,您应该优先考虑 NPP,可以从www.nvidia.com/object/npp.html免费下载(或通过您最喜爱的网络搜索访问)引擎)。
In addition to the routines offered in the CUFFT and CUBLAS libraries, NVIDIA also maintains a library of functions for performing CUDA-accelerated data processing known as the NVIDIA Performance Primitives (NPP). Currently, NPP’s initial set of functionality focuses specifically on imaging and video processing and is widely applicable for developers in these areas. NVIDIA intends for NPP to evolve over time to address a greater number of computing tasks in a wider range of domains. If you have an interest in high-performance imaging or video applications, you should make it a priority to look into NPP, available as a free download at www.nvidia.com/object/npp.html (or accessible from your favorite web search engine).
我们从各种来源了解到,在极少数情况下,计算机软件在首次执行时并不能完全按照预期工作。有些代码计算出错误的值,有些代码无法终止执行,有些代码甚至使计算机进入只有按一下电源开关才能补救的状态。尽管本书的作者显然从未亲自编写过这样的代码,但他们认识到一些软件工程师可能需要资源来调试他们的 CUDA C 内核。幸运的是,NVIDIA 提供了一些工具,可以大大简化这个痛苦的过程。
We have heard from a variety of sources that, in rare instances, computer software does not work exactly as intended when first executed. Some code computes incorrect values, some fails to terminate execution, and some code even puts the computer into a state that only a flip of the power switch can remedy. Although having clearly never written code like this personally, the authors of this book recognize that some software engineers may desire resources to debug their CUDA C kernels. Fortunately, NVIDIA provides tools to make this painful process significantly less troublesome.
CUDA-GDB工具是在基于 Linux 的系统上开发代码的 CUDA C 程序员可以使用的最有用的 CUDA 下载工具之一。 NVIDIA 扩展了开源 GNU 调试器 (GNU debugger gdb),以透明地支持实时调试设备代码,同时保持熟悉的gdb.在 CUDA-GDB 之前,除了使用 CPU 来模拟其预期的运行方式之外,没有任何好的方法来调试设备代码。这种方法的调试速度极其缓慢,而且事实上,它常常无法准确地近似内核的 GPU 执行情况。 NVIDIA 的 CUDA-GDB 使程序员能够直接在 GPU 上调试内核,为他们提供所有他们已经习惯了 CPU 调试器的控制。 CUDA-GDB 的一些亮点包括:
A tool known as CUDA-GDB is one of the most useful CUDA downloads available to CUDA C programmers who develop their code on Linux-based systems. NVIDIA extended the open source GNU debugger (gdb) to transparently support debugging device code in real time while maintaining the familiar interface of gdb. Prior to CUDA-GDB, there existed no good way to debug device code outside of using the CPU to simulate the way in which it was expected to run. This method yielded extremely slow debugging, and in fact, it was frequently a very poor approximation of the exact GPU execution of the kernel. NVIDIA’s CUDA-GDB enables programmers to debug their kernels directly on the GPU, affording them all of the control that they’ve grown accustomed to with CPU debuggers. Some of the highlights of CUDA-GDB include the following:
• 查看 CUDA 状态,例如有关已安装 GPU 及其功能的信息
• Viewing CUDA state, such as information regarding installed GPUs and their capabilities
• 在 CUDA C 源代码中设置断点
• Setting breakpoints in CUDA C source code
• 检查 GPU 内存,包括所有全局内存和共享内存
• Inspecting GPU memory, including all global and shared memory
• 检查当前驻留在 GPU 上的块和线程
• Inspecting the blocks and threads currently resident on the GPU
• 单步执行螺纹扭曲
• Single-stepping a warp of threads
• 闯入当前正在运行的应用程序,包括挂起或死锁的应用程序
• Breaking into currently running applications, including hung or deadlocked applications
除了调试器之外,NVIDIA 还提供了 CUDA 内存检查器,其功能可以通过 CUDA-GDB 或独立工具cuda-memcheck.由于 CUDA 架构包括直接内置于硬件中的复杂内存管理单元,因此硬件将检测并阻止所有非法内存访问。由于内存违规,您的程序将停止按预期运行,因此您肯定希望了解这些类型的错误。启用后,CUDA 内存检查器将检测内核尝试进行的任何全局内存违规或未对齐的全局内存访问,并以比以前更加有用和详细的方式向您报告。
Along with the debugger, NVIDIA provides the CUDA Memory Checker whose functionality can be accessed through CUDA-GDB or the stand-alone tool, cuda-memcheck. Because the CUDA Architecture includes a sophisticated memory management unit built directly into the hardware, all illegal memory accesses will be detected and prevented by the hardware. As a result of a memory violation, your program will cease functioning as expected, so you will certainly want visibility into these types of errors. When enabled, the CUDA Memory Checker will detect any global memory violations or misaligned global memory accesses that your kernel attempts to make, reporting them to you in a far more helpful and verbose manner than previously possible.
尽管 CUDA-GDB 是一款成熟且出色的工具,可用于在硬件上实时调试 CUDA C 内核,但 NVIDIA 认识到并非每个开发人员都对 Linux 感到欣喜若狂。因此,除非 Windows 用户通过存钱开设自己的宠物店来两面下注,否则他们也需要一种调试应用程序的方法。 2009 年底,NVIDIA 推出了 NVIDIA Parallel Nsight(最初代号为 Nexus),这是第一个适用于 Microsoft Visual Studio 的集成 GPU/CPU 调试器。与 CUDA-GDB 一样,Parallel Nsight 支持调试具有数千个线程的 CUDA 应用程序。用户可以在 CUDA C 源代码中的任何位置放置断点,包括在写入任意内存位置时触发的断点。他们可以直接从 Visual Studio 内存窗口检查 GPU 内存,并检查是否存在越界内存访问。截至发稿时,该工具已在测试版中公开发布,最终版本应该很快就会发布。
Although CUDA-GDB is a mature and fantastic tool for debugging your CUDA C kernels on hardware in real time, NVIDIA recognizes that not every developer is over the moon about Linux. So, unless Windows users are hedging their bets by saving up to open their own pet stores, they need a way to debug their applications, too. Toward the end of 2009, NVIDIA introduced NVIDIA Parallel Nsight (originally code-named Nexus), the first integrated GPU/CPU debugger for Microsoft Visual Studio. Like CUDA-GDB, Parallel Nsight supports debugging CUDA applications with thousands of threads. Users can place breakpoints anywhere in their CUDA C source code, including breakpoints that trigger on writes to arbitrary memory locations. They can inspect GPU memory directly from the Visual Studio Memory window and check for out-of-bounds memory accesses. This tool has been made publicly available in a beta program as of press time, and the final version should be released shortly.
我们经常将 CUDA 架构吹捧为高性能计算应用程序的良好基础。不幸的是,现实情况是,在从应用程序中找出所有错误后,即使是最善意的“高性能计算”应用程序也可以更准确地简称为“计算”应用程序。我们常常会想,“为什么我的代码在 Sam Hill 的表现这么差?”在这种情况下,能够在分析工具的监视下执行有问题的内核会有所帮助。 NVIDIA 就提供了这样一个工具,可以在 CUDA Zone 网站上单独下载。图 12.1显示了用于比较矩阵转置运算的两种实现的 Visual Profiler。尽管不看一行代码,但很容易确定transpose()内核的内存和指令吞吐量都超过了内核transpose_naive()。 (但话又说回来,对一个名字中带有幼稚的函数期望更多是不公平的。)
We often tout the CUDA Architecture as a wonderful foundation for high-performance computing applications. Unfortunately, the reality is that after ferreting out all the bugs from your applications, even the most well-meaning “high-performance computing” applications are more accurately referred to as simply “computing” applications. We have often been in the position where we wonder, “Why in the Sam Hill is my code performing so poorly?” In situations like this, it helps to be able to execute the kernels in question under the watchful gaze of a profiling tool. NVIDIA provides just such a tool, available as a separate download on the CUDA Zone website. Figure 12.1 shows the Visual Profiler being used to compare two implementations of a matrix transpose operation. Despite not looking at a line of code, it becomes quite easy to determine that both memory and instruction throughput of the transpose() kernel outstrip that of the transpose_naive() kernel. (But then again, it would be unfair to expect much more from a function with naive in the name.)
图 12.1用于分析矩阵转置应用程序的 CUDA Visual Profiler
Figure 12.1 The CUDA Visual Profiler being used to profile a matrix transpose application
CUDA Visual Profiler 将执行您的应用程序,检查 GPU 内置的特殊性能计数器。执行后,探查器可以根据这些计数器编译数据,并根据观察到的情况向您提供报告。它可以验证您的应用程序执行每个内核所花费的时间,并确定启动的块数、内核的内存访问是否合并、代码中执行的扭曲分支的数量等等。如果您有一些需要解决的微妙性能问题,我们鼓励您研究 CUDA Visual Profiler。
The CUDA Visual Profiler will execute your application, examining special performance counters built into the GPU. After execution, the profiler can compile data based on these counters and present you with reports based on what it observed. It can verify how long your application spends executing each kernel as well as determine the number of blocks launched, whether your kernel’s memory accesses are coalesced, the number of divergent branches the warps in your code execute, and so on. We encourage you to look into the CUDA Visual Profiler if you have some subtle performance problems in need of resolution.
如果你还没有对本书中的所有散文感到厌倦,那么你可能真的有兴趣阅读更多内容。我们知道,你们中的一些人更有可能想要通过玩代码来继续学习,但对于其他人来说,还有其他书面资源可以帮助您作为 CUDA C 编码员不断成长。
If you haven’t already grown queasy from all the prose in this book, then it’s possible you might actually be interested in reading more. We know that some of you are more likely to want to play with code in order to continue your learning, but for the rest of you, there are additional written resources to maintain your growth as a CUDA C coder.
如果您阅读了第一章,我们向您保证这本书绝对不是一本关于并行架构的教科书。当然,我们讨论了诸如多处理器和warp之类的术语,但是本书致力于教授使用 CUDA C 及其附带 API 进行编程的更软的一面。我们在NVIDIA CUDA 编程指南中规定的编程模型中学习了 CUDA C 语言,很大程度上忽略了 NVIDIA 硬件实际完成我们赋予它的任务的方式。
If you read Chapter 1, we assured you that this book was most decidedly not a textbook on parallel architectures. Sure, we bandied about terms such as multiprocessor and warp, but this book strives to teach the softer side of programming with CUDA C and its attendant APIs. We learned the CUDA C language within the programming model set forth in the NVIDIA CUDA Programming Guide, largely ignoring the way NVIDIA’s hardware actually accomplishes the tasks we give it.
但要真正成为一名高级、全面的 CUDA C 程序员,您需要更深入地熟悉 CUDA 架构以及 NVIDIA GPU 在幕后工作方式的一些细微差别。为了实现这一目标,我们建议您通过编程大规模并行处理器:实践方法进行学习。 NVIDIA 前首席科学家 David Kirk 与伊利诺伊大学电气与计算机工程系 WJ Sanders III 主席 Wen-mei W. Hwu 合作编写了该报告。您将遇到许多熟悉的术语和概念,但您将了解 NVIDIA CUDA 架构的具体细节,包括线程调度和延迟容忍、内存带宽使用和效率、浮点细节处理,等等。本书还比本书更广泛地讨论了并行编程,因此您将更好地全面了解如何为大型复杂问题设计并行解决方案。
But to truly become an advanced, well-rounded CUDA C programmer, you will need a more intimate familiarity with the CUDA Architecture and some of the nuances of how NVIDIA GPUs work behind the scenes. To accomplish this, we recommend working your way through Programming Massively Parallel Processors: A Hands-on Approach. To write it, David Kirk, formerly NVIDIA’s chief scientist, collaborated with Wen-mei W. Hwu, the W.J. Sanders III chairman in electrical and computer engineering at University of Illinois. You’ll encounter a number of familiar terms and concepts, but you will learn about the gritty details of NVIDIA’s CUDA Architecture, including thread scheduling and latency tolerance, memory bandwidth usage and efficiency, specifics on floating-point handling, and much more. The book also addresses parallel programming in a more general sense than this book, so you will gain a better overall understanding of how to engineer parallel solutions to large, complex problems.
我们中的一些人很不幸,在进入激动人心的 GPU 计算世界之前就已经上大学了。对于那些有幸现在或不久的将来上大学的人来说,目前全球大约有 300 所大学教授涉及 CUDA 的课程。但在你开始快速节食以适应你的大学装备之前,还有一个选择!在 CUDA Zone 网站上,您会找到CUDA U的链接,它本质上是一所提供 CUDA 教育的在线大学。或者您可以使用 URL www.nvidia.com/object/cuda_education直接导航到那里。虽然如果你参加 CUDA U 的一些在线讲座,你将能够学到很多关于 GPU 计算的知识,但截至发稿时,仍然没有课后聚会的在线兄弟会。
Some of us were unlucky enough to have attended university prior to the exciting world of GPU computing. For those who are fortunate enough to be attending college now or in the near future, about 300 universities across the world currently teach courses involving CUDA. But before you start a crash diet to fit back into your college gear, there’s an alternative! On the CUDA Zone website, you will find a link for CUDA U, which is essentially an online university for CUDA education. Or you can navigate directly there with the URL www.nvidia.com/object/cuda_education. Although you will be able to learn quite a bit about GPU computing if you attend some of the online lectures at CUDA U, as of press time there are still no online fraternities for partying after class.
在 CUDA 教育的众多来源中,亮点之一包括伊利诺伊大学关于 CUDA C 编程的完整课程。NVIDIA 和伊利诺伊大学以 M4V 视频格式免费为您的 iPod、iPhone、或兼容的视频播放器。我们知道您在想什么:“终于有一种方法可以让我在机动车辆管理局排队时学习 CUDA!”您可能还想知道为什么我们要等到本书的最后才告诉您本书的电影版本的存在。我们很抱歉耽搁了你,但无论如何,电影并不像书里那么好,对吧?除了来自伊利诺伊大学和加州大学戴维斯分校的实际课程材料之外,您还可以找到来自 CUDA 培训播客的材料以及第三方培训和咨询服务的链接。
Among the myriad sources of CUDA education, one of the highlights includes an entire course from the University of Illinois on programming in CUDA C. NVIDIA and the University of Illinois provide this content free of charge in the M4V video format for your iPod, iPhones, or compatible video players. We know what you’re thinking: “Finally, a way to learn CUDA while I wait in line at the Department of Motor Vehicles!” You may also be wondering why we waited until the very end of this book to inform you of the existence of what is essentially a movie version of this book. We’re sorry for holding out on you, but the movie is hardly ever as good as the book anyway, right? In addition to actual course materials from the University of Illinois and from the University of California Davis, you will also find materials from CUDA Training Podcasts and links to third-party training and consultancy services.
30 多年来,Dr. Dobb几乎涵盖了计算技术的每一个重大发展,NVIDIA 的 CUDA 也不例外。作为正在进行的系列文章的一部分,Dobb 博士发表了一系列广泛的文章,广泛涉及 CUDA 领域。该系列题为CUDA,面向大众的超级计算,首先介绍 GPU 计算并取得进展快速从第一个内核过渡到 CUDA 编程模型的其他部分。Dobb 博士的文章涵盖了错误处理、全局内存性能、共享内存、CUDA Visual Profiler、纹理内存、CUDA-GDB 和数据并行 CUDA 原语的 CUDPP 库以及许多其他主题。本系列文章是获取有关我们试图在本书中传达的一些材料的更多信息的绝佳场所。此外,您还可以找到有关我们在本文中只浏览过的一些工具的实用信息,例如可供您使用的分析和调试选项。该系列文章可从 CUDA 专区网页链接,但可以通过网络搜索Dr Dobbs CUDA轻松访问。
For more than 30 years, Dr. Dobb’s has covered nearly every major development in computing technology, and NVIDIA’s CUDA is no exception. As part of an ongoing series, Dr. Dobb’s has published an extensive series of articles cutting a broad swath through the CUDA landscape. Entitled CUDA, Supercomputing for the Masses, the series starts with an introduction to GPU computing and progresses quickly from a first kernel to other pieces of the CUDA programming model. The articles in Dr. Dobb’s cover error handling, global memory performance, shared memory, the CUDA Visual Profiler, texture memory, CUDA-GDB, and the CUDPP library of data-parallel CUDA primitives, as well as many other topics. This series of articles is an excellent place to get additional information about some of the material we’ve attempted to convey in this book. Furthermore, you’ll find practical information concerning some of the tools that we’ve only had time to glance over in this text, such as the profiling and debugging options available to you. The series of articles is linked from the CUDA Zone web page but is readily accessible through a web search for Dr Dobbs CUDA.
即使在深入研究了 NVIDIA 的所有文档后,您也可能会发现自己有一个未解答或特别有趣的问题。也许您想知道其他人是否见过您正在经历的一些奇怪的行为。或者,您可能正在举办 CUDA 庆祝派对,并希望聚集一群志同道合的人。对于您有兴趣询问的任何问题,我们强烈推荐 NVIDIA 网站上的论坛。该论坛位于http://forums.nvidia.com,是向其他 CUDA 用户提问的好地方。事实上,读完这本书后,如果你愿意,你就有可能帮助别人! NVIDIA 员工也会定期浏览论坛,因此最棘手的问题将直接从源头获得权威建议。我们也很乐意收到有关新功能的建议以及有关 NVIDIA 所做的好的、坏的和丑陋的事情的反馈。
Even after digging around all of NVIDIA’s documentation, you may find yourself with an unanswered or particularly intriguing question. Perhaps you’re wondering whether anyone else has seen some funky behavior you’re experiencing. Or maybe you’re throwing a CUDA celebration party and wanted to assemble a group of like-minded individuals. For anything you’re interested in asking, we strongly recommend the forums on NVIDIA’s website. Located at http://forums.nvidia.com, the forums are a great place to ask questions of other CUDA users. In fact, after reading this book, you’re in a position to potentially help others if you want! NVIDIA employees regularly prowl the forums, too, so the trickiest questions will prompt authoritative advice right from the source. We also love to get suggestions for new features and feedback on the good, bad, and ugly things that we at NVIDIA do.
尽管 NVIDIA GPU 计算 SDK 是操作示例的宝库,但它的设计用途仅限于教学。如果您正在寻找生产级别、由 CUDA 驱动的库或源代码,您需要进一步研究。幸运的是,有一个庞大的 CUDA 开发人员社区,他们已经开发出了一流的解决方案。这里提供了一些这样的工具和库,但我们鼓励您在网上搜索您需要的任何解决方案。嘿,也许有一天您会为 CUDA C 社区贡献一些自己的东西!
Although the NVIDIA GPU Computing SDK is a treasure trove of how-to samples, it’s not designed to be used for much more than pedagogy. If you’re hunting for production-caliber, CUDA-powered libraries or source code, you’ll need to look a bit further. Fortunately, there is a large community of CUDA developers who have produced top-notch solutions. A couple of these tools and libraries are presented here, but you are encouraged to search the Web for whatever solutions you need. And hey, maybe you’ll contribute some of your own to the CUDA C community some day!
NVIDIA 在加州大学戴维斯分校研究人员的帮助下,发布了一个名为 CUDA 数据并行基元库 (CUDPP) 的库。 CUDPP,顾名思义,是一个数据并行算法原语库。其中一些原语包括并行前缀和 ( scan )、并行排序和并行归约。诸如此类的原语构成了各种数据并行算法的基础,包括排序、流压缩、构建数据结构等等。如果您希望编写一个中等复杂的算法,那么 CUDPP 很可能已经拥有您需要的东西,或者它可以让您更接近您想要的目标。请从http://code.google.com/p/cudpp下载。
NVIDIA, with the help of researchers at the University of California Davis, has released a library known as the CUDA Data Parallel Primitives Library (CUDPP). CUDPP, as the name indicates, is a library of data-parallel algorithm primitives. Some of these primitives include parallel prefix-sum (scan), parallel sort, and parallel reduction. Primitives such as these form the foundation of a wide variety of data-parallel algorithms, including sorting, stream compaction, building data structures, and many others. If you’re looking to write an even moderately complex algorithm, chances are good that either CUDPP already has what you need or it can get you significantly closer to where you want to be. Download it at http://code.google.com/p/cudpp.
正如我们在第 12.2.3 节:CUBLAS 中提到的,NVIDIA 提供了与 CUDA 工具包下载一起打包的 BLAS 实现。对于需要更广泛的线性代数解决方案的读者,请查看 EM Photonics 的行业标准线性代数包 (LAPACK) 的 CUDA 实现。其 LAPACK 实现称为CULAtools,并提供基于 NVIDIA CUBLAS 技术构建的更复杂的线性代数例程。免费提供的 Basic 软件包提供 LU 分解、QR 分解、线性系统求解器和奇异值分解,以及最小二乘和约束最小二乘求解器。您可以从www.culatools.com/versions/basic获取 Basic 下载。您还会注意到,EM Photonics 提供高级版和商业版许可证,其中包含大部分 LAPACK 例程,以及允许您基于 CULAtools 分发自己的商业应用程序的许可条款。
As we mentioned in Section 12.2.3: CUBLAS, NVIDIA provides an implementation of the BLAS packaged along with the CUDA Toolkit download. For readers who need a broader solution for linear algebra, take a look at EM Photonics’ CUDA implementation of the industry-standard Linear Algebra Package (LAPACK). Its LAPACK implementation is known as CULAtools and offers more complex linear algebra routines that are built on NVIDIA’s CUBLAS technology. The freely available Basic package offers LU decomposition, QR factorization, linear system solver, and singular value decomposition, as well as least squares and constrained least squares solvers. You can obtain the Basic download at www.culatools.com/versions/basic. You will also notice that EM Photonics offers Premium and Commercial licenses, which contain a far greater fraction of the LAPACK routines, as well as licensing terms that will allow you to distribute your own commercial applications based on CULAtools.
本书主要关注 C 和 C++,但显然存在数百个不使用这些语言的项目。幸运的是,第三方已经编写了包装器,允许从 NVIDIA 官方不支持的语言访问 CUDA 技术。 NVIDIA 本身为其 CUBLAS 库提供了 FORTRAN 绑定,但您也可以在www.jcuda.org上找到多个 CUDA 库的 Java 绑定。同样,允许从 Python 应用程序使用 CUDA C 内核的 Python 包装器可从 PyCUDA 项目获取,网址为http://mathema.tician.de/software/pycuda。最后,CUDA.NET 项目提供了针对 Microsoft .NET 环境的绑定,网址为www.hoopoe-cloud.com/Solutions/CUDA.NET。
This book has primarily been concerned with C and C++, but clearly hundreds of projects exist that don’t employ these languages. Fortunately, third parties have written wrappers to allow access to CUDA technology from languages not officially supported by NVIDIA. NVIDIA itself provides FORTRAN bindings for its CUBLAS library, but you can also find Java bindings for several of the CUDA libraries at www.jcuda.org. Likewise, Python wrappers to allow the use of CUDA C kernels from Python applications are available from the PyCUDA project at http://mathema.tician.de/software/pycuda. Finally, there are bindings for the Microsoft .NET environment available from the CUDA.NET project at www.hoopoe-cloud.com/Solutions/CUDA.NET.
尽管这些项目并未得到 NVIDIA 的官方支持,但它们已经存在多个版本的 CUDA,并且都是免费提供的,并且每个项目都有许多成功的客户。这个故事的寓意是,如果您选择的语言(或您老板的选择)不是 C 或 C++,那么在您首先查看是否有必要的绑定可用之前,您不应该排除 GPU 计算。
Although these projects are not officially supported by NVIDIA, they have been around for several versions of CUDA, are all freely available, and each has many successful customers. The moral of this story is, if your language of choice (or your boss’s choice) is not C or C++, you should not rule out GPU computing until you’ve first looked to see whether the necessary bindings are available.
现在你就得到了它。即使读完 11 章 CUDA C,仍然有大量资源需要下载、阅读、观看和编译。随着异构计算平台时代的成熟,现在是学习 GPU 计算的非常有趣的时期。我们希望您喜欢了解现有最普遍的并行编程环境之一。此外,我们希望您在结束这次体验时对开发新的、令人兴奋的方式与计算机交互以及处理软件可用的不断增加的信息的可能性感到兴奋。您的想法和您开发的令人惊叹的技术将把 GPU 计算推向新的水平。
And there you have it. Even after 11 chapters of CUDA C, there are still loads of resources to download, read, watch, and compile. This is a remarkably interesting time to be learning GPU computing, as the era of heterogeneous computing platforms matures. We hope that you have enjoyed learning about one of the most pervasive parallel programming environments in existence. Moreover, we hope that you leave this experience excited about the possibilities to develop new and exciting means for interacting with computers and for processing the ever-increasing amount of information available to your software. It’s your ideas and the amazing technologies you develop that will push GPU computing to the next level.
第 9 章介绍了我们可以使用原子操作使数百个线程安全地对共享数据进行并发修改的一些方法。在本附录中,我们将了解使用原子来实现锁定数据结构的高级方法。从表面上看,这个主题似乎并不比我们研究过的其他主题复杂多少。事实上,这是准确的。通过本书您已经了解了许多复杂的主题,锁定数据结构并不比这些更具挑战性。那么,为什么这些材料隐藏在附录中呢?我们不想透露任何剧透,所以如果您感兴趣,请继续阅读,我们将在附录中讨论这一点。
Chapter 9 covered some of the ways in which we can use atomic operations to enable hundreds of threads to safely make concurrent modifications to shared data. In this appendix, we’ll look at an advanced method for using atomics to implement locking data structures. On its surface, this topic does not seem much more complicated than anything else we’ve examined. And in reality, this is accurate. You’ve learned a lot of complex topics through this book, and locking data structures are no more challenging than these. So, why is this material hiding in the appendix? We don’t want to reveal any spoilers, so if you’re intrigued, read on, and we’ll discuss this through the course of the appendix.
在第 5 章中,我们研究了使用 CUDA C 实现矢量点积。该算法是称为归约的一大类算法之一。如果您还记得的话,该算法通过执行以下操作来计算两个输入向量的点积:
In Chapter 5, we looked at the implementation of a vector dot product using CUDA C. This algorithm was one of a large family of algorithms known as reductions. If you recall, the algorithm computed the dot product of two input vectors by doing the following:
1. 每个块中的每个线程将输入向量的两个相应元素相乘,并将乘积存储在共享内存中。
1. Each thread in each block multiplies two corresponding elements of the input vectors and stores the products in shared memory.
2. 虽然一个块有多个乘积,但线程会将其中两个乘积相加并将结果存储回共享内存。每个步骤产生的值是开始时的一半(这就是术语归约的由来)
2. Although a block has more than one product, a thread adds two of the products and stores the result back to shared memory. Each step results in half as many values as it started with (this is where the term reduction comes from)
3. 当每个块都有最终总和时,每个块都会将其值写入全局内存并退出。
3. When every block has a final sum, each one writes its value to global memory and exits.
4. 如果内核使用N并行块运行,CPU 会将这些剩余N值相加以生成最终的点积。
4. If the kernel ran with N parallel blocks, the CPU sums these remaining N values to generate the final dot product.
这种对点积算法的高级观察旨在进行回顾,因此,如果已经有一段时间了或者您已经喝了几杯霞多丽,那么可能值得花时间回顾第 5 章。如果您对点积代码感到足够满意并可以继续,请注意算法中的步骤 4。虽然它不涉及将大量数据复制到主机或在 CPU 上执行许多计算,但将计算移回 CPU 来完成确实像听起来一样尴尬。
This high-level look at the dot product algorithm is intended to be review, so if it’s been a while or you’ve had a couple glasses of Chardonnay, it may be worth the time to review Chapter 5. If you feel comfortable enough with the dot product code to continue, draw your attention to step 4 in the algorithm. Although it doesn’t involve copying much data to the host or performing many calculations on the CPU, moving the computation back to the CPU to finish is indeed as awkward as it sounds.
但这不仅仅是算法的尴尬步骤或解决方案的不优雅的问题。考虑这样一种场景,其中点积计算只是一长串操作中的一个步骤。如果您想在 GPU 上执行所有操作,因为您的 CPU 正忙于其他任务或计算,那么您就不走运了。按照目前的情况,您将被迫停止 GPU 上的计算,将中间结果复制回主机,使用 CPU 完成计算,最后将该结果上传回 GPU 并使用下一个内核恢复计算。
But it’s more than an issue of an awkward step to the algorithm or the inelegance of the solution. Consider a scenario where a dot product computation is just one step in a long sequence of operations. If you want to perform every operation on the GPU because your CPU is busy with other tasks or computations, you’re out of luck. As it stands, you’ll be forced to stop computing on the GPU, copy intermediate results back to the host, finish the computation with the CPU, and finally upload that result back to the GPU and resume computing with your next kernel.
由于这是关于原子的附录,并且我们已经花费了如此长的篇幅来解释我们最初的点积算法有多么痛苦,所以您应该明白我们的前进方向。我们打算使用原子来修复我们的点积,以便整个计算可以保留在 GPU 上,从而使 CPU 可以自由地执行其他任务。理想情况下,我们希望每个块将其最终结果添加到全局内存中的总数中,而不是在步骤 3 中退出内核并在步骤 4 中返回到 CPU。如果每个值都是原子添加的,我们就不必担心潜在的冲突或不确定的结果。由于我们已经atomicAdd()在直方图运算中使用了一个运算,因此这似乎是一个显而易见的选择。
Since this is an appendix on atomics and we have gone to such lengths to explain what a pain our original dot product algorithm is, you should see where we’re heading. We intend to fix our dot product using atomics so the entire computation can stay on the GPU, leaving your CPU free to perform other tasks. Ideally, instead of exiting the kernel in step 3 and returning to the CPU in step 4, we want each block to add its final result to a total in global memory. If each value were added atomically, we would not have to worry about potential collisions or indeterminate results. Since we have already used an atomicAdd() operation in the histogram operation, this seems like an obvious choice.
不幸的是,在计算能力 2.0 之前,atomicAdd()仅对整数进行操作。虽然如果您计划计算具有整数分量的向量的点积,这可能没问题,但使用浮点分量更为常见。然而,大多数 NVIDIA 硬件不支持浮点数的原子运算!但对此有一个合理的解释,所以暂时不要把你的 GPU 扔进垃圾桶。
Unfortunately, prior to compute capability 2.0, atomicAdd()operated only on integers. Although this might be fine if you plan to compute dot products of vectors with integer components, it is significantly more common to use floating-point components. However, the majority of NVIDIA hardware does not support atomic arithmetic on floating-point numbers! But there’s a reasonable explanation for this, so don’t throw your GPU in the garbage just yet.
对内存中值的原子操作仅保证每个线程的读取-修改-写入序列将完成,而其他线程在处理过程中不会读取或写入目标值。对于线程执行操作的顺序没有规定,所以在三个线程执行加法的情况下,有时硬件执行(A+B)+C,有时硬件计算A+(B+C)。这对于整数来说是可以接受的,因为整数数学是结合的,所以(A+B)+C = A+(B+C).由于中间结果的舍入,浮点运算不(A+B)+C具有关联性,因此通常不等于A+(B+C)。因此,浮点值的原子算术的实用性值得怀疑,因为它会在高度多线程环境(例如 GPU)中产生不确定的结果。在许多应用程序中,从应用程序的两次运行中获得两个不同的结果是根本不可接受的,因此浮点原子算术的支持并不是早期硬件的优先事项。
Atomic operations on a value in memory guarantee only that each thread’s read-modify-write sequence will complete without other threads reading or writing the target value while in process. There is no stipulation about the order in which the threads will perform their operations, so in the case of three threads performing addition, sometimes the hardware will perform (A+B)+C and sometimes it will compute A+(B+C). This is acceptable for integers because integer math is associative, so (A+B)+C = A+(B+C). Floating-point arithmetic is not associative because of the rounding of intermediate results, so (A+B)+C often does not equal A+(B+C). As a result, atomic arithmetic on floating-point values is of dubious utility because it gives rise to nondeterministic results in a highly multithreaded environment such as on the GPU. There are many applications where it is simply unacceptable to get two different results from two runs of an application, so the support of floating-point atomic arithmetic was not a priority for earlier hardware.
然而,如果我们愿意容忍结果中的一些不确定性,我们仍然可以完全在 GPU 上完成缩减。但我们首先需要开发一种方法来解决原子浮点运算的缺乏。该解决方案仍将使用原子操作,但不用于算术本身。
However, if we are willing to tolerate some nondeterminism in the results, we can still accomplish the reduction entirely on the GPU. But we’ll first need to develop a way to work around the lack of atomic floating-point arithmetic. The solution will still use atomic operations, but not for the arithmetic itself.
atomicAdd()我们用来构建 GPU 直方图的函数执行读取-修改-写入操作,而不会受到其他线程的中断。在较低级别上,您可以想象硬件在执行此操作时锁定目标内存位置,并且在锁定时,其他线程都无法读取或写入该位置处的值。如果我们有办法在 CUDA C 内核中模拟此锁,我们就可以在关联的内存位置上执行任意操作或数据结构。锁定机制本身的运行方式与典型的 CPU互斥体完全相同。如果您不熟悉互斥(mutex),请不要担心。它并不比你已经学到的东西更复杂。
The atomicAdd() function we used to build GPU histograms performed a read-modify-write operation without interruption from other threads. At a low level, you can imagine the hardware locking the target memory location while this operation is underway, and while locked, no other threads can read or write the value at the location. If we had a way of emulating this lock in our CUDA C kernels, we could perform arbitrary operations on an associated memory location or data structure. The locking mechanism itself will operate exactly like a typical CPU mutex. If you are unfamiliar with mutual exclusion (mutex), don’t fret. It’s not any more complicated than the things you’ve already learned.
基本思想是我们分配一小块内存用作互斥体。互斥体的作用就像交通信号一样,控制对某些资源的访问。该资源可以是数据结构、缓冲区,或者只是我们想要原子修改的内存位置。当线程从互斥锁中读取 0 时,它会将此值解释为“绿灯”,表示没有其他线程正在使用该内存。因此,线程可以自由地锁定内存并进行任何它想要的更改,而不受其他线程的干扰。为了锁定有问题的内存位置,线程将 1 写入互斥体。该 1 将充当潜在竞争线程的“红灯”。然后,竞争线程必须等待,直到所有者将 0 写入互斥锁,然后才能尝试修改锁定的内存。
The basic idea is that we allocate a small piece of memory to be used as a mutex. The mutex will act like something of a traffic signal that governs access to some resource. The resource could be a data structure, a buffer, or simply a memory location we want to modify atomically. When a thread reads a 0 from the mutex, it interprets this value as a “green light” indicating that no other thread is using the memory. Therefore, the thread is free to lock the memory and make whatever changes it desires, free of interference from other threads. To lock the memory location in question, the thread writes a 1 to the mutex. This 1 will act as a “red light” for potentially competing threads. The competing threads must then wait until the owner has written a 0 to the mutex before they can attempt to modify the locked memory.
完成此锁定过程的简单代码序列可能如下所示:
A simple code sequence to accomplish this locking process might look like this:
不幸的是,这段代码有问题。幸运的是,这是一个熟悉的问题:如果在我们的线程读取值为零后另一个线程向互斥锁写入 1,会发生什么情况?也就是说,两个线程都检查 的值mutex并发现它为零。然后,它们都向该位置写入 1,以向其他线程表示该结构已锁定且无法修改。这样做之后,两个线程都认为它们拥有关联的内存或数据结构,并开始进行不安全的修改。灾难随之而来!
Unfortunately, there’s a problem with this code. Fortunately, it’s a familiar problem: What happens if another thread writes a 1 to the mutex after our thread has read the value to be zero? That is, both threads check the value at mutex and see that it’s zero. They then both write a 1 to this location to signify to other threads that the structure is locked and unavailable for modification. After doing so, both threads think they own the associated memory or data structure and begin making unsafe modifications. Catastrophe ensues!
我们想要完成的操作相当简单:我们需要将 处的值与mutex0 进行比较,并当且仅当 为 0 时在该位置存储 1。mutex为了正确完成此操作,整个操作需要以原子方式执行,因此我们知道当我们的线程检查和更新 处的值时,没有其他线程可以干扰mutex。在 CUDA C 中,可以使用atomicCAS()原子比较和交换函数 来执行此操作。该函数atomicCAS()采用一个指向内存的指针、一个用于与该位置处的值进行比较的值以及一个在比较成功时存储在该位置中的值。使用这个操作,我们可以实现GPU锁定功能,如下所示:
The operation we want to complete is fairly simple: We need to compare the value at mutex to 0 and store a 1 at that location if and only if the mutex was 0. To accomplish this correctly, this entire operation needs to be performed atomically so we know that no other thread can interfere while our thread examines and updates the value at mutex. In CUDA C, this operation can be performed with the function atomicCAS(), an atomic compare-and-swap. The function atomicCAS() takes a pointer to memory, a value with which to compare the value at that location, and a value to store in that location if the comparison is successful. Using this operation, we can implement a GPU lock function as follows:
调用atomicCAS()返回在该地址找到的值mutex。结果,while()循环将继续运行,直到atomicCAS()在 处看到 0 mutex。当它看到 0 时,比较成功,线程将 1 写入mutex。本质上,线程将在while()循环中旋转,直到成功锁定数据结构。我们将使用这种锁定机制来实现 GPU 哈希表。但首先,我们将代码整理成一个结构,以便在点积应用程序中使用它会更清晰:
The call to atomicCAS() returns the value that it found at the address mutex. As a result, the while() loop will continue to run until atomicCAS() sees a 0 at mutex. When it sees a 0, the comparison is successful, and the thread writes a 1 to mutex. Essentially, the thread will spin in the while() loop until it has successfully locked the data structure. We’ll use this locking mechanism to implement our GPU hash table. But first, we dress the code up in a structure so it will be cleaner to use in the dot product application:
mutex请注意,我们恢复了with的值atomicExch( mutex, 0 )。该函数atomicExch()读取位于 的值mutex,交换它与第二个参数(在本例中为 0),并返回它读取的原始值。为什么我们要使用原子函数而不是更明显的方法来重置值mutex?
Notice that we restore the value of mutex with atomicExch( mutex, 0 ). The function atomicExch() reads the value that is located at mutex, exchanges it with the second argument (a 0 in this case), and returns the original value it read. Why would we use an atomic function for this rather than the more obvious method to reset the value at mutex?
*互斥量=0;
*mutex = 0;
如果您期待此方法失败的一些微妙的、隐藏的原因,我们不想让您失望,但这也可以。那么,为什么不使用这种更明显的方法呢?原子事务和通用全局内存操作通过 GPU 遵循不同的路径。因此,同时使用原子和标准全局内存操作可能会导致unlock()与随后的互斥体尝试看起来不同步lock()。该行为在功能上仍然是正确的,但为了从应用程序的角度确保一致的直观行为,最好对互斥锁的所有访问使用相同的路径。因为我们需要使用原子来锁定资源,所以我们选择也使用原子来解锁资源。
If you’re expecting some subtle, hidden reason why this method fails, we hate to disappoint you, but this would work as well. So, why not use this more obvious method? Atomic transactions and generic global memory operations follow different paths through the GPU. Using both atomics and standard global memory operations could therefore lead to an unlock() seeming out of sync with a subsequent attempt to lock() the mutex. The behavior would still be functionally correct, but to ensure consistently intuitive behavior from the application’s perspective, it’s best to use the same pathway for all accesses to the mutex. Because we’re required to use an atomic to lock the resource, we have chosen to also use an atomic to unlock the resource.
我们之前的点积示例中唯一需要更改的部分是最终基于 CPU 的缩减部分。在上一节中,我们描述了如何在 GPU 上实现互斥体。实现此互斥体的结构Lock位于lock.h并包含在我们改进的点积示例的开头:
The only piece of our earlier dot product example that we endeavor to change is the final CPU-based portion of the reduction. In the previous section, we described how we implement a mutex on the GPU. The Lock structure that implements this mutex is located in lock.h and included at the beginning of our improved dot product example:
除了两个例外,我们的点积内核的开头与我们在第 5 章中使用的内核相同。这两个异常都涉及内核签名:
With two exceptions, the beginning of our dot product kernel is identical to the kernel we used in Chapter 5. Both exceptions involve the kernel’s signature:
在我们更新的点积中,Lock除了输入向量和输出缓冲区之外,我们还将 a 传递给内核。将Lock在最终累加步骤期间控制对输出缓冲区的访问。另一种变化从签名上看不出来,但涉及到签名。以前,float *c参数是浮点数的缓冲区N,其中每个块N都可以存储其部分结果。该缓冲区被复制回 CPU 以计算最终总和。现在,参数c不再指向临时缓冲区,而是指向单个浮点值,该值将存储a和中向量的点积b。但即使进行了这些更改,内核的启动方式仍然与第 5 章中的完全相同:
In our updated dot product, we pass a Lock to the kernel in addition to input vectors and the output buffer. The Lock will govern access to the output buffer during the final accumulation step. The other change is not noticeable from the signature but involves the signature. Previously, the float *c argument was a buffer for N floats where each of the N blocks could store its partial result. This buffer was copied back to the CPU to compute the final sum. Now, the argument c no longer points to a temporary buffer but to a single floating-point value that will store the dot product of the vectors in a and b. But even with these changes, the kernel starts out exactly as it did in Chapter 5:
在执行过程中的此时,每个块中的 256 个线程已将其 256 个成对乘积相加,并计算出位于 中的单个值cache[0]。每个线程块现在需要将其最终值添加到 处的值c。为了安全地做到这一点,我们将使用锁来管理对此内存位置的访问,因此每个线程都需要在更新值 * 之前获取锁c。将块的部分和添加到 at 的值后c,它会解锁互斥体,以便其他线程可以累积它们的值。将其值添加到最终结果后,该块不再需要计算,可以从内核返回。
At this point in execution, the 256 threads in each block have summed their 256 pairwise products and computed a single value that’s sitting in cache[0]. Each thread block now needs to add its final value to the value at c. To do this safely, we’ll use the lock to govern access to this memory location, so each thread needs to acquire the lock before updating the value *c. After adding the block’s partial sum to the value at c, it unlocks the mutex so other threads can accumulate their values. After adding its value to the final result, the block has nothing remaining to compute and can return from the kernel.
该main()例程与我们最初的实现非常相似,尽管它确实有一些差异。首先,我们不再需要像第 5 章中那样为部分结果分配缓冲区。我们现在只为单个浮点结果分配空间:
The main() routine is very similar to our original implementation, though it does have a couple differences. First, we no longer need to allocate a buffer for partial results as we did in Chapter 5. We now allocate space for only a single floating-point result:
正如我们在第 5 章中所做的那样,我们初始化输入数组并将它们复制到 GPU。但您会注意到在此示例中存在一个额外的副本:我们还将零复制到dev_c,这是我们打算用来累积最终点积的位置。由于每个块都想要读取该值,添加其部分和并将结果存储回来,因此我们需要将初始值设置为零才能获得正确的结果。
As we did in Chapter 5, we initialize our input arrays and copy them to the GPU. But you’ll notice an additional copy in this example: We’re also copying a zero to dev_c, the location that we intend to use to accumulate our final dot product. Since each block wants to read this value, add its partial sum, and store the result back, we need the initial value to be zero in order to get the correct result.
剩下的就是声明 our Lock、调用内核并将结果复制回 CPU。
All that remains is declaring our Lock, invoking the kernel, and copying the result back to the CPU.
在第 5 章中,我们将执行最后一个for()循环来添加部分和。由于这是使用原子锁在 GPU 上完成的,因此我们可以直接跳到答案检查和清理代码:
In Chapter 5, this is when we would do a final for() loop to add the partial sums. Since this is done on the GPU using atomic locks, we can skip right to the answer-checking and cleanup code:
由于无法精确预测每个块将其部分和添加到最终总数中的顺序,因此很可能(几乎可以肯定)最终结果的求和顺序与 CPU 求和的顺序不同。由于浮点加法的非关联性,GPU 和 CPU 之间的最终结果很可能会略有不同。如果不添加大量代码来确保块以与 CPU 上的求和顺序相匹配的确定性顺序获取锁,对此我们无能为力。如果您感觉非常有动力,请尝试一下。否则,我们将继续了解如何使用这些原子锁来实现多线程数据结构。
Because there is no way to precisely predict the order in which each block will add its partial sum to the final total, it is very likely (almost certain) that the final result will be summed in a different order than the CPU will sum it. Because of the nonassociativity of floating-point addition, it’s therefore quite probable that the final result will be slightly different between the GPU and CPU. There is not much that can be done about this without adding a nontrivial chunk of code to ensure that the blocks acquire the lock in a deterministic order that matches the summation order on the CPU. If you feel extraordinarily motivated, give this a try. Otherwise, we’ll move on to see how these atomic locks can be used to implement a multithreaded data structure.
哈希表是计算机科学中最重要和最常用的数据结构之一,在各种应用中发挥着重要作用。对于还不熟悉哈希表的读者,我们将在这里提供快速入门。数据结构的研究需要比我们打算提供的更深入的研究,但为了取得进展,我们将保持简短。如果您已经熟悉哈希表背后的概念,您应该跳到A.2.2 节:CPU 哈希表中的哈希表实现。
The hash table is one of the most important and commonly used data structures in computer science, playing an important role in a wide variety of applications. For readers not already familiar with hash tables, we’ll provide a quick primer here. The study of data structures warrants more in-depth study than we intend to provide, but in the interest of making forward progress, we will keep this brief. If you already feel comfortable with the concepts behind hash tables, you should skip to the hash table implementation in Section A.2.2: A CPU Hash Table.
哈希表本质上是一种用于存储键和值对的结构。例如,您可以将字典视为哈希表。字典中的每个单词都是一个键,每个单词都有一个与之相关的定义。定义是与单词相关联的值,因此字典中的每个单词和定义形成一个键/值对。然而,为了使这种数据结构有用,如果给定一个键,我们必须最大限度地减少查找特定值所需的时间,这一点很重要。一般来说,这应该是一个恒定的时间。也就是说,无论哈希表中有多少个键/值对,在给定键的情况下查找值的时间都应该相同。
A hash table is essentially a structure that is designed to store pairs of keys and values. For example, you could think of a dictionary as a hash table. Every word in the dictionary is a key, and each word has a definition associated with it. The definition is the value associated with the word, and thus every word and definition in the dictionary form a key/value pair. For this data structure to be useful, though, it is important that we minimize the time it takes to find a particular value if we’re given a key. In general, this should be a constant amount of time. That is, the time to look up a value given a key should be the same, regardless of how many key/ value pairs are in the hash table.
在抽象层面上,我们的哈希表将根据值对应的键将值放入“桶”中。我们将键映射到存储桶的方法通常称为哈希函数。一个好的哈希函数会将可能的键集统一映射到所有存储桶中,因为这将有助于满足我们的要求,即无论我们添加到哈希表中的值的数量如何,都需要恒定的时间来查找任何值。
At an abstract level, our hash table will place values in “buckets” based on the value’s corresponding key. The method by which we map keys to buckets is often called the hash function. A good hash function will map the set of possible keys uniformly across all the buckets because this will help satisfy our requirement that it take constant time to find any value, regardless of the number of values we’ve added to the hash table.
例如,考虑我们的字典哈希表。一个明显的哈希函数将涉及使用 26 个存储桶,每个存储桶对应字母表中的每个字母。这个简单的哈希函数可能只是查看键的第一个字母,然后将值放入基于该字母的 26 个存储桶之一中。图 A.1显示了该散列函数如何分配几个示例单词。
For example, consider our dictionary hash table. One obvious hash function would involve using 26 buckets, one for each letter of the alphabet. This simple hash function might simply look at the first letter of the key and put the value in one of the 26 buckets based on this letter. Figure A.1 shows how this hash function would assign few sample words.
Figure A.1 Hashing of words into buckets
鉴于我们对英语单词分布的了解,这个哈希函数还有很多不足之处,因为它不会在 26 个存储桶中统一映射单词。一些桶将包含很少的键/值对,而一些桶将包含大量的键/值对。因此,查找与以普通字母(例如 S)开头的单词关联的值比查找与以字母 X 开头的单词关联的值所需的时间要长得多。由于我们正在寻找哈希函数这将使我们能够以恒定时间检索任何值,但这种结果是相当不可取的。对哈希函数的研究已经进行了大量的研究,但即使是对这些技术的简要概述也超出了本书的范围。
Given what we know about the distribution of words in the English language, this hash function leaves much to be desired because it will not map words uniformly across the 26 buckets. Some of the buckets will contain very few key/value pairs, and some of the buckets will contain a large number of pairs. Accordingly, it will take much longer to find the value associated with a word that begins with a common letter such as S than it would take to find the value associated with a word that begins with the letter X. Since we are looking for hash functions that will give us constant-time retrieval of any value, this consequence is fairly undesirable. An immense amount of research has gone into the study of hash functions, but even a brief survey of these techniques is beyond the scope of this book.
哈希表数据结构的最后一个组成部分涉及桶。如果我们有一个完美的哈希函数,每个键都会映射到不同的存储桶。在这种情况下,我们可以简单地将键/值对存储在数组中,其中数组中的每个条目就是我们所说的存储桶。然而,即使有一个优秀的哈希函数,在大多数情况下我们也必须处理冲突。当多个键映射到一个存储桶时,例如当我们将avocado和aardvark这两个词添加到字典哈希表中时,就会发生冲突。存储映射到给定存储桶的所有值的最简单方法就是维护存储桶中的值列表。当我们遇到冲突时,例如将aardvark添加到已经包含avocado的字典中,我们会将与aardvark关联的值放在我们维护的“A”桶中列表的末尾,如图A.2所示。
The last component of our hash table data structure involves the buckets. If we had a perfect hash function, every key would map to a different bucket. In this case, we can simply store the key/value pairs in an array where each entry in the array is what we’ve been calling a bucket. However, even with an excellent hash function, in most situations we will have to deal with collisions. A collision occurs when more than one key maps to a bucket, such as when we add both the words avocado and aardvark to our dictionary hash table. The simplest way to store all of the values that map to a given bucket is simply to maintain a list of values in the bucket. When we encounter a collision, such as adding aardvark to a dictionary that already contains avocado, we put the value associated with aardvark at the end of the list we’re maintaining in the “A” bucket, as shown in Figure A.2.
在图 A.2中添加单词avocado后,第一个存储桶的列表中有一个键/值对。稍后在这个想象的应用程序中,我们添加了单词aardvark,这个单词与avocado冲突,因为它们都以字母A开头。您会在图 A.3中注意到,它只是被放置在第一个存储桶中列表的末尾:
After adding the word avocado in Figure A.2, the first bucket has a single key/ value pair in its list. Later in this imaginary application we add the word aardvark, a word that collides with avocado because they both start with the letter A. You will notice in Figure A.3 that it simply gets placed at the end of the list in the first bucket:
Figure A.2 Inserting the word avocado into the hash table
Figure A.3 Resolving the conflict when adding the word aardvark
有了一些关于哈希函数和冲突解决概念的背景知识,我们就准备好考虑实现我们自己的哈希表了。
Armed with some background on the notions of a hash function and collision resolution, we’re ready to take a look at implementing our own hash table.
如上一节所述,我们的哈希表本质上由两部分组成:哈希函数和桶的数据结构。我们的存储桶将像以前一样实现:我们将分配一个长度为 的数组N,数组中的每个条目都保存一个键/值对列表。在讨论哈希函数之前,我们先看一下所涉及的数据结构:
As described in the previous section, our hash table will consist of essentially two parts: a hash function and a data structure of buckets. Our buckets will be implemented exactly as before: We will allocate an array of length N, and each entry in the array holds a list of key/value pairs. Before concerning ourselves with a hash function, we will take a look at the data structures involved:
如介绍部分所述,该结构Entry同时包含键和值。在我们的应用程序中,我们将使用无符号整数键来存储键/值对。与此键关联的值可以是任何数据,因此我们声明value为 avoid*来指示这一点。我们的应用程序将主要关注创建哈希表数据结构,因此我们实际上不会在该value字段中存储任何内容。为了完整性,我们已将其包含在结构中,以防您想在自己的应用程序中使用此代码。哈希表中的最后一条数据Entry是指向下一条数据的指针Entry。发生冲突后,我们将在同一个存储桶中拥有多个条目,并且我们决定将这些条目存储为列表。因此,每个条目将指向存储桶中的下一个条目,从而形成散列到表中相同位置的条目列表。最后一个条目将有一个NULL next指针。
As described in the introductory section, the structure Entry holds both a key and a value. In our application, we will use unsigned integer keys to store our key/value pairs. The value associated with this key can be any data, so we have declared value as a void* to indicate this. Our application will primarily be concerned with creating the hash table data structure, so we won’t actually store anything in the value field. We have included it in the structure for completeness, in case you want to use this code in your own applications. The last piece of data in our hash table Entry is a pointer to the next Entry. After collisions, we’ll have multiple entries in the same bucket, and we have decided to store these entries as a list. So, each entry will point to the next entry in the bucket, thereby forming a list of entries that have hashed to the same location in the table. The last entry will have a NULL next pointer.
从本质上讲,该Table结构本身就是一系列“桶”。这个桶数组只是一个长度为 的数组count,其中的每个桶entries只是一个指向 的指针Entry。为了避免每次我们想要Entry向表中添加 an 时分配内存带来的复杂性和性能影响,该表将在 中维护大量可用条目pool。该字段firstFree指向下一个可用的字段Entry,因此当我们需要向表中添加条目时,我们可以简单地使用 指向的Entry并firstFree递增该指针。请注意,这也将简化我们的清理代码,因为我们可以通过一次调用来释放所有这些条目free()。如果我们边走边分配每个条目,我们将不得不遍历表并逐个释放每个条目。
At its heart, the Table structure itself is an array of “buckets.” This bucket array is just an array of length count, where each bucket in entries is just a pointer to an Entry. To avoid incurring the complication and performance hit of allocating memory every time we want to add an Entry to the table, the table will maintain a large array of available entries in pool. The field firstFree points to the next available Entry for use, so when we need to add an entry to the table, we can simply use the Entry pointed to by firstFree and increment that pointer. Note that this will also simplify our cleanup code because we can free all of these entries with a single call to free(). If we had allocated every entry as we went, we would have to walk through the table and free every entry one by one.
了解完涉及到的数据结构后,我们再看一下其他一些支持代码:
After understanding the data structures involved, let’s take a look at some of the other support code:
表初始化主要包括为桶数组分配内存和清除内存entries。我们还为条目池分配存储空间,并将firstFree指针初始化为池数组中的第一个条目。
Table initialization consists primarily of allocating memory and clearing memory for the bucket array entries. We also allocate storage for a pool of entries and initialize the firstFree pointer to be the first entry in the pool array.
在应用程序结束时,我们需要释放已分配的内存,因此我们的清理例程会释放存储桶数组和空闲条目池:
At the end of the application, we’ll want to free the memory we’ve allocated, so our cleanup routine frees the bucket array and the pool of free entries:
在我们的介绍中,我们详细讨论了哈希函数。具体来说,我们讨论了一个好的哈希函数如何区分优秀的哈希表实现和糟糕的哈希表实现。在此示例中,我们使用无符号整数作为键,并且需要将它们映射到存储桶数组的索引。最简单的方法是选择索引等于键的存储桶。也就是说,我们可以将条目存储e在table.entries[e.key].但是,我们无法保证每个键都小于桶数组的长度。幸运的是,这个问题可以相对轻松地解决:
In our introduction, we spoke quite a bit about the hash function. Specifically, we discussed how a good hash function can make the difference between an excellent hash table implementation and poor one. In this example, we’re using unsigned integers as our keys, and we need to map these to the indices of our bucket array. The simplest way to do this would be to select the bucket with an index equal to the key. That is, we could store the entry e in table.entries[e.key]. However, we have no way of guaranteeing that every key will be less than the length of the array of buckets. Fortunately, this problem can be solved relatively painlessly:
如果哈希函数如此重要,那么我们如何才能摆脱如此简单的哈希函数呢?理想情况下,我们希望键在表中的所有存储桶中统一映射,而我们在这里所做的就是对键对数组长度取模。实际上,哈希函数通常可能不会这么简单,但因为这只是一个示例程序,所以我们将随机生成密钥。如果我们假设随机数生成器大致均匀地生成值,则该哈希函数应该将这些键均匀地映射到哈希表的所有存储桶中。在您自己的哈希表实现中,您可能需要更复杂的哈希函数。
If the hash function is so important, how can we get away with such a simple one? Ideally, we want the keys to map uniformly across all the buckets in our table, and all we’re doing here is taking the key modulo the array length. In reality, hash functions may not normally be this simple, but because this is just an example program, we will be randomly generating our keys. If we assume that the random number generator generates values roughly uniformly, this hash function should map these keys uniformly across all of the buckets of the hash table. In your own hash table implementation, you may require a more complicated hash function.
了解了哈希表结构和哈希函数后,我们准备看看向表中添加键/值对的过程。该过程涉及三个基本步骤:
Having seen the hash table structures and the hash function, we’re ready to look at the process of adding a key/value pair to the table. The process involves three basic steps:
1. 计算输入键的哈希函数以确定新条目的存储桶。
1. Compute the hash function on the input key to determine the new entry’s bucket.
Entry2.从池中获取预分配的内存并初始化其key和value字段。
2. Take a preallocated Entry from the pool and initialize its key and value fields.
3. 将条目插入到适当存储桶列表的前面。
3. Insert the entry at the front of the proper bucket’s list.
我们以相当简单的方式将这些步骤转化为代码。
We translate these steps to code in a fairly straightforward way.
如果您从未见过链表(或者已经有一段时间了),那么第 3 步一开始可能很难理解。现有列表的第一个节点存储在table.entries[hashValue]。考虑到这一点,我们可以通过两个步骤在链表的头部插入一个新节点:首先,我们将新条目的next指针设置为指向现有链表中的第一个节点。然后,我们将新条目存储在存储桶数组中,使其成为新列表的第一个节点。
If you have never seen linked lists (or it’s been a while), step 3 may be tricky to understand at first. The existing list has its first node stored at table.entries[hashValue]. With this in mind, we can insert a new node at the head of the list in two steps: First, we set our new entry’s next pointer to point to the first node in the existing list. Then, we store the new entry in the bucket array so it becomes the first node of the new list.
由于了解您编写的代码是否有效是个好主意,因此我们实现了一个例程来对哈希表执行健全性检查。检查涉及首先遍历表并检查每个节点。我们计算节点密钥的哈希函数,并确认该节点存储在正确的存储桶中。检查每个节点后,我们验证表中实际的节点数确实等于我们打算添加到表中的元素数。如果这些数字不一致,那么要么我们不小心将一个节点添加到了多个存储桶中,要么我们没有正确插入它。
Since it’s a good idea to have some idea whether the code you’ve written works, we’ve implemented a routine to perform a sanity check on a hash table. The check involves first walking through the table and examining every node. We compute the hash function on the node’s key and confirm that the node is stored in the correct bucket. After checking every node, we verify that the number of nodes actually in the table is indeed equal to the number of elements we intended to add to the table. If these numbers don’t agree, then either we’ve added a node accidentally to multiple buckets or we haven’t inserted it correctly.
完成所有基础设施代码后,我们可以看看main().与本书的许多示例一样,许多繁重的工作都是在辅助函数中完成的,因此我们希望这main()将相对容易理解:
With all the infrastructure code out of the way, we can look at main(). As with many of this book’s examples, a lot of the heavy lifting has been done in helper functions, so we hope that main() will be relatively easy to follow:
正如您所看到的,我们首先分配一大块随机数。这些随机生成的无符号整数将是我们插入哈希表的键。生成数字后,我们读取系统时间以衡量实现的性能。我们初始化哈希表,然后使用循环将每个随机键插入表中for()。添加所有密钥后,我们再次读取系统时间以计算初始化和添加密钥所用的时间。最后,我们使用健全性检查例程验证哈希表并释放我们分配的缓冲区。
As you can see, we start by allocating a big chunk of random numbers. These randomly generated unsigned integers will be the keys we insert into our hash table. After generating the numbers, we read the system time in order to measure the performance of our implementation. We initialize the hash table and then insert each random key into the table using a for() loop. After adding all the keys, we read the system time again to compute the elapsed time to initialize and add the keys. Finally, we verify the hash table with our sanity check routine and free the buffers we’ve allocated.
您可能注意到我们正在使用NULL每个键/值对的值。在典型的应用程序中,您可能会使用键存储一些有用的数据,但由于我们主要关注哈希表实现本身,因此我们使用每个键存储一个无意义的值。
You probably noticed that we are using NULL as the value for every key/value pair. In a typical application, you would likely store some useful data with the key, but because we are primarily concerned with the hash table implementation itself, we’re storing a meaningless value with each key.
我们的 CPU 哈希表中内置了一些假设,当我们转移到 GPU 时,这些假设将不再有效。首先,为了让节点的添加更加简单,我们假设一次只能向表中添加一个节点。如果多个线程试图同时向表中添加一个节点,我们最终可能会遇到类似于第 9 章示例中的多线程添加问题的问题。
There are some assumptions built into our CPU hash table that will no longer be valid when we move to the GPU. First, we have assumed that only one node can be added to the table at a time in order to make the addition of a node simpler. If more than one thread were trying to add a node to the table at once, we could end up with problems similar to the multithreaded addition problems in the example from Chapter 9.
例如,让我们回顾一下“鳄梨和土豚”示例,并假设线程 A 和 B 正在尝试将这些条目添加到表中。线程 A 在avocado上计算哈希函数,线程 B 在aardvark上计算该函数。他们都决定他们的密钥属于同一个桶。为了将新条目添加到列表中,线程 A 和 B 首先将新条目的next指针设置为现有列表的第一个节点,如图A.4所示。
For example, let’s revisit our “avocado and aardvark” example and imagine that threads A and B are trying to add these entries to the table. Thread A computes a hash function on avocado, and thread B computes the function on aardvark. They both decide their keys belong in the same bucket. To add the new entry to the list, thread A and B start by setting their new entry’s next pointer to the first node of the existing list as in Figure A.4.
然后,两个线程都尝试用新条目替换存储桶数组中的条目。但是,第二个完成的线程是唯一保留其更新的线程,因为它覆盖了前一个线程的工作。因此,请考虑线程 A 将入口高度替换为其avocado入口的场景。完成后,线程 B 立即将其认为是海拔的条目替换为aardvark 的条目。不幸的是,它取代了鳄梨而不是海拔,导致出现图 A.5所示的情况。
Then, both threads try to replace the entry in the bucket array with their new entry. However, the thread that finishes second is the only thread that has its update preserved because it overwrites the work of the previous thread. So consider the scenario where thread A replaces the entry altitude with its entry for avocado. Immediately after finishing, thread B replaces what it believes to be the entry for altitude with its entry for aardvark. Unfortunately, it’s replacing avocado instead of altitude, resulting in the situation illustrated in Figure A.5.
Figure A.4 Multiple threads attempting to add a node to the same bucket
Figure A.5 The hash table after an unsuccessful concurrent modification by two threads
可悲的是,线程 A 的条目“漂浮”在哈希表之外。幸运的是,我们的健全性检查例程会捕获此问题并提醒我们存在问题,因为它计算的节点数量比我们预期的要少。但我们仍然需要回答这个问题:我们如何在GPU上构建哈希表?!这里的关键观察涉及这样一个事实:一次只有一个线程可以安全地对存储桶进行修改。这类似于我们的点积示例,其中一次只有一个线程可以安全地将其值添加到最终结果中。如果每个存储桶都有一个与之关联的原子锁,我们就可以确保一次只有一个线程对给定的存储桶进行更改。
Thread A’s entry is tragically “floating” outside of the hash table. Fortunately, our sanity check routine would catch this and alert us to the presence of a problem because it would count fewer nodes than we expected. But we still need to answer this question: How do we build a hash table on the GPU?! The key observation here involves the fact that only one thread can safely make modifications to a bucket at a time. This is similar to our dot product example where only one thread at a time could safely add its value to the final result. If each bucket had an atomic lock associated with it, we could ensure that only a single thread was making changes to a given bucket at a time.
有了确保安全多线程访问哈希表的方法,我们可以继续使用我们在A.2.2 节:CPU 哈希表中编写的哈希表应用程序的 GPU 实现。我们需要包含A.1.1 节原子锁中lock.hGPU 结构的实现,并且需要将哈希函数声明为函数。除了这些变化之外,基本数据结构和哈希函数与 CPU 实现相同。Lock__device_
Armed with a method to ensure safe multithreaded access to the hash table, we can proceed with a GPU implementation of the hash table application we wrote in Section A.2.2: A CPU Hash Table. We’ll need to include lock.h, the implementation of our GPU Lock structure from Section A.1.1 Atomic Locks, and we’ll need to declare the hash function as a __device_ function. Aside from these changes, the fundamental data structures and hash function are identical to the CPU implementation.
初始化和释放哈希表的步骤与我们在 CPU 上执行的步骤相同,但与前面的示例一样,我们使用 CUDA 运行时函数来完成此操作。我们用来cudaMalloc()分配一个桶数组和一个条目池,并且用来cudaMemset()将桶数组条目设置为零。为了在应用程序完成时释放内存,我们使用cudaFree().
Initializing and freeing the hash table consists of the same steps as we performed on the CPU, but as with previous examples, we use CUDA runtime functions to accomplish this. We use cudaMalloc() to allocate a bucket array and a pool of entries, and we use cudaMemset() to set the bucket array entries to zero. To free the memory upon application completion, we use cudaFree().
我们使用一个例程来检查哈希表在 CPU 实现中的正确性。 GPU 版本需要类似的例程,因此我们有两个选择。我们可以编写基于 GPU 的版本verify_table(),或者可以使用与 CPU 版本中使用的相同代码,并添加一个将哈希表从 GPU 复制到 CPU 的函数。尽管任一选项都能满足我们的需要,但第二个选项似乎更优越,原因有二:首先,它涉及重用我们的 CPU 版本的verify_table().与一般的代码重用一样,这可以节省时间并确保将来只需在两个版本的哈希表的一个位置对代码进行更改。其次,实现复制功能将发现一个有趣的问题,该问题的解决方案可能对您将来非常有用。
We used a routine to check our hash table for correctness in the CPU implementation. We need a similar routine for the GPU version, so we have two options. We could write a GPU-based version of verify_table(), or we could use the same code we used in the CPU version and add a function that copies a hash table from the GPU to the CPU. Although either option gets us what we need, the second option seems superior for two reasons: First, it involves reusing our CPU version of verify_table(). As with code reuse in general, this saves time and ensures that future changes to the code would need to be made in only one place for both versions of the hash table. Second, implementing a copy function will uncover an interesting problem, the solution to which may be very useful to you in the future.
正如所承诺的,verify_table()与 CPU 实现相同,为方便起见,在此处重印:
As promised, verify_table() is identical to the CPU implementation and is reprinted here for your convenience:
由于我们选择重用 CPU 实现verify_table(),因此我们需要一个函数将表从 GPU 内存复制到主机内存。这个函数有三个步骤,两个相对明显的步骤和第三个比较棘手的步骤。前两个步骤涉及为哈希表数据分配主机内存,并使用 执行将 GPU 数据结构复制到该内存中cudaMemcpy()。我们之前已经这样做过很多次了,所以这应该不足为奇。
Since we chose to reuse our CPU implementation of verify_table(), we need a function to copy the table from GPU memory to host memory. There are three steps to this function, two relatively obvious steps and a third, trickier step. The first two steps involve allocating host memory for the hash table data and performing a copy of the GPU data structures into this memory with cudaMemcpy(). We have done this many times previously, so this should come as no surprise.
该例程的棘手部分涉及这样一个事实:我们复制的一些数据是指针。我们不能简单地将这些指针复制到主机,因为它们是 GPU 上的地址;它们将不再是主机上的有效指针。但是,指针的相对偏移量仍然有效。每个GPU指针指向数组Entry中的某个位置table.pool[],但为了使哈希表在主机上可用,我们需要它们指向数组Entry中的相同位置hostTable.pool[]。
The tricky portion of this routine involves the fact that some of the data we have copied are pointers. We cannot simply copy these pointers to the host because they are addresses on the GPU; they will no longer be valid pointers on the host. However, the relative offsets of the pointers will still be valid. Every GPU pointer to an Entry points somewhere within the table.pool[] array, but for the hash table to be usable on the host, we need them to point to the same Entry in the hostTable.pool[] array.
给定一个 GPU 指针 X,我们因此需要添加该指针从 到 的偏移量table.pool以hostTable.pool获得有效的主机指针。也就是说,新指针应按如下方式计算:
Given a GPU pointer X, we therefore need to add the pointer’s offset from table.pool to hostTable.pool to get a valid host pointer. That is, the new pointer should be computed as follows:
(X - 表.池) + 主机表.池
(X - table.pool) + hostTable.pool
Entry我们对从 GPU 复制的每个指针执行此更新:表条目池中的Entry指针hostTable.entries和next每个指针:Entry
We perform this update for every Entry pointer we’ve copied from the GPU: the Entry pointers in hostTable.entries and the next pointer of every Entry in the table’s pool of entries:
看过数据结构、哈希函数、初始化、清理和验证代码后,剩下的最重要的部分是实际涉及 CUDA C 原子的部分。作为参数,add_to_table()内核将把一组键和值添加到哈希表中。它的下一个参数是哈希表本身,最后一个参数是将用于锁定表的每个存储桶的锁数组。由于我们的输入是线程需要索引的两个数组,因此我们还需要非常常见的索引线性化:
Having seen the data structures, hash function, initialization, cleanup, and verification code, the most important piece remaining is the one that actually involves CUDA C atomics. As arguments, the add_to_table() kernel will take an array of keys and values to be added to the hash table. Its next argument is the hash table itself, and the final argument is an array of locks that will be used to lock each of the table’s buckets. Since our input is two arrays that our threads will need to index, we also need our all-too-common index linearization:
我们的线程遍历输入数组,就像在点积示例中一样。对于数组中的每个键keys[],线程将计算哈希函数,以确定键/值对属于哪个桶。确定目标桶后,线程锁定该桶,添加其键/值对,并解锁该桶。桶。
Our threads walk through the input arrays exactly like they did in the dot product example. For each key in the keys[] array, the thread will compute the hash function in order to determine which bucket the key/value pair belongs in. After determining the target bucket, the thread locks the bucket, adds its key/value pair, and unlocks the bucket.
然而,这段代码有一些非常奇特的地方。循环for()和后续if()语句似乎完全没有必要。在第 6 章中,我们介绍了扭曲的概念。如果您忘记了,warp 是 32 个同步执行的线程的集合。虽然如何在 GPU 中实现这一点的细微差别超出了本书的范围,但每次只能有一个线程在 warp 中获取锁,如果我们让 warp 中的所有 32 个线程都竞争,我们将会遇到很多麻烦。同时进行锁定。在这种情况下,我们发现最好在软件中完成一些工作,并简单地遍历经线中的每个线程,让每个线程都有机会获取数据结构的锁,完成其工作,然后释放锁。
There is something remarkably peculiar about this bit of code, however. The for() loop and subsequent if() statement seem decidedly unnecessary. In Chapter 6, we introduced the concept of a warp. If you’ve forgotten, a warp is a collection of 32 threads that execute together in lockstep. Although the nuances of how this gets implemented in the GPU are beyond the scope of this book, only one thread in the warp can acquire the lock at a time, and we will suffer many a headache if we let all 32 threads in the warp contend for the lock simultaneously. In this situation, we’ve found that it’s best to do some of the work in software and simply walk through each thread in the warp, giving each a chance to acquire the data structure’s lock, do its work, and subsequently release the lock.
流程main()应该与 CPU 实现相同。我们首先为哈希表键分配大量随机数据。然后我们创建启动和停止 CUDA 事件并记录性能的启动事件测量。我们继续为随机键数组分配 GPU 内存,将数组复制到设备,并初始化哈希表:
The flow of main() should appear identical to the CPU implementation. We start by allocating a large chunk of random data for our hash table keys. Then we create start and stop CUDA events and record the start event for our performance measurements. We proceed to allocate GPU memory for our array of random keys, copy the array up to the device, and initialize our hash table:
准备构建哈希表的最后一步涉及为哈希表的存储桶准备锁。我们为哈希表中的每个桶分配一把锁。可以想象,我们可以通过对整个表仅使用一个锁来节省大量内存。但这样做会完全破坏性能,因为每当一组线程试图同时向表中添加条目时,每个线程都必须竞争表锁。因此,我们声明一个锁数组,数组中的每个桶都有一个锁。然后我们为锁分配一个 GPU 数组并将它们复制到设备:
The last step of preparation to build our hash table involves preparing locks for the hash table’s buckets. We allocate one lock for each bucket in the hash table. Conceivably we could save a lot of memory by using only one lock for the whole table. But doing so would utterly destroy performance because every thread would have to compete for the table lock whenever a group of threads tries to simultaneously add entries to the table. So we declare an array of locks, one for every bucket in the array. We then allocate a GPU array for the locks and copy them up to the device:
其余部分main()与 CPU 版本类似:我们将所有键添加到哈希表中,停止性能计时器,验证哈希表的正确性,然后进行清理:
The rest of main() is similar to the CPU version: We add all of our keys to the hash table, stop the performance timer, verify the correctness of the hash table, and clean up after ourselves:
使用 Intel Core 2 Duo, A.2.2 节中的 CPU 哈希表示例:CPU 哈希表需要 360 毫秒才能从 100MB 数据构建哈希表。该代码是通过选项构建的,-O3以确保最大限度地优化 CPU 代码。A.2.4 节中的多线程 GPU 哈希表:GPU 哈希表需要 375 毫秒才能完成相同的任务。虽然相差不到 5%,但执行时间大致相当,这就提出了一个很好的问题:为什么像 GPU 这样的大规模并行机器会被同一应用程序的单线程 CPU 版本击败?坦率地说,这是因为 GPU 的设计并不擅长对复杂数据结构(例如哈希表)进行多线程访问。因此,很少有出于性能动机在 GPU 上构建哈希表等数据结构。因此,如果您的应用程序需要做的就是构建哈希表或类似的数据结构,那么您最好在 CPU 上执行此操作。
Using an Intel Core 2 Duo, the CPU hash table example in Section A.2.2: A CPU Hash Table takes 360ms to build a hash table from 100MB of data. The code was built with the option -O3 to ensure maximally optimized CPU code. The multithreaded GPU hash table in Section A.2.4: A GPU Hash Table takes 375ms to complete the same task. Differing by less than 5 percent, these are roughly comparable execution times, which raises an excellent question: Why would such a massively parallel machine such as a GPU get beaten by a single-threaded CPU version of the same application? Frankly, this is because GPUs were not designed to excel at multithreaded access to complex data structures such as a hash table. For this reason, there are very few performance motivations to build a data structure such as a hash table on the GPU. So if all your application needs to do is build a hash table or similar data structure, you would likely be better off doing this on your CPU.
另一方面,您有时会发现自己处于这样一种情况:较长的计算管道涉及一个或两个阶段,而 GPU 与同类 CPU 实现相比并不具有性能优势。在这些情况下,您有三个(有些明显)选择:
On the other hand, you will sometimes find yourself in a situation where a long computation pipeline involves one or two stages that the GPU does not enjoy a performance advantage over comparable CPU implementations. In these situations, you have three (somewhat obvious) options:
• 在GPU上执行管道的每一步
• Perform every step of the pipeline on the GPU
• 在CPU 上执行管道的每一步
• Perform every step of the pipeline on the CPU
• 在 GPU 上执行一些管道步骤,在 CPU 上执行一些管道步骤
• Perform some pipeline steps on the GPU and some on the CPU
最后一个选项听起来像是两全其美。但是,这意味着您需要在应用程序中将计算从 GPU 移至 CPU 或返回时随时同步 CPU 和 GPU。主机和 GPU 之间的这种同步和后续数据传输通常会消除您最初采用混合方法可能获得的任何性能优势。
The last option sounds like the best of both worlds; however, it implies that you will need to synchronize your CPU and GPU at any point in your application where you want to move computation from the GPU to CPU or back. This synchronization and subsequent data transfer between host and GPU can often kill any performance advantage you might have derived from employing a hybrid approach in the first place.
在这种情况下,即使 GPU 不太适合算法的某些步骤,也可能值得您花时间在 GPU 上执行计算的每个阶段。在这种情况下,GPU 哈希表可能会阻止 CPU/GPU 同步点,最大限度地减少主机和 GPU 之间的数据传输,并释放 CPU 来执行其他计算。在这种情况下,GPU 实现的整体性能可能会超过 CPU/GPU 混合方法,尽管 GPU 在某些步骤上并不比 CPU 快(或者在某些情况下甚至可能被 CPU 击败)。
In such a situation, it may be worth your time to perform every phase of computation on the GPU, even if the GPU is not ideally suited for some steps of the algorithm. In this vein, the GPU hash table can potentially prevent a CPU/GPU synchronization point, minimize data transfer between the host and GPU and free the CPU to perform other computations. In such a scenario, it’s possible that the overall performance of a GPU implementation would exceed a CPU/GPU hybrid approach, despite the GPU being no faster than the CPU on certain steps (or potentially even getting trounced by the CPU in some cases).
我们了解了如何使用原子比较和交换操作来实现 GPU 互斥体。使用使用此互斥体构建的锁,我们了解了如何改进原始点积应用程序以完全在 GPU 上运行。我们进一步实现了这个想法,实现了一个多线程哈希表,该哈希表使用锁数组来防止多个线程同时进行不安全的修改。事实上,我们开发的互斥体可以用于任何形式的并行数据结构,我们希望您会发现它在您自己的实验和应用程序开发中很有用。当然,使用 GPU 实现基于互斥体的数据结构的应用程序的性能需要仔细研究。我们的 GPU 哈希表会被相同代码的单线程 CPU 版本击败,因此仅在某些情况下才将 GPU 用于此类应用程序才有意义。没有一揽子规则可用于确定仅 GPU、仅 CPU 或混合方法是否效果最佳,但了解如何使用原子将允许您根据具体情况做出决定。
We saw how to use atomic compare-and-swap operations to implement a GPU mutex. Using a lock built with this mutex, we saw how to improve our original dot product application to run entirely on the GPU. We carried this idea further by implementing a multithreaded hash table that used an array of locks to prevent unsafe simultaneous modifications by multiple threads. In fact, the mutex we developed could be used for any manner of parallel data structures, and we hope that you’ll find it useful in your own experimentation and application development. Of course, the performance of applications that use the GPU to implement mutex-based data structures needs careful study. Our GPU hash table gets beaten by a single-threaded CPU version of the same code, so it will make sense to use the GPU for this type of application only in certain situations. There is no blanket rule that can be used to determine whether a GPU-only, CPU-only, or hybrid approach will work best, but knowing how to use atomics will allow you to make that decision on a case-by-case basis.
add()函数,CPU 向量和,40–44
add() function, CPU vector sums, 40–44
add_to_table()内核,GPU 哈希表,272
add_to_table() kernel, GPU hash table, 272
ALU(算术逻辑单元)
ALUs (arithmetic logic units)
CUDA 架构,7
CUDA Architecture, 7
使用常量内存,96
using constant memory, 96
anim_and_exit()方法,GPU 波纹,70
anim_and_exit() method, GPU ripples, 70
anim_gpu() routine, texture memory, 123, 129
动画片
animation
GPU Julia 集示例,50–57
GPU Julia Set example, 50–57
使用线程的 GPU 波动,69–74
GPU ripple using threads, 69–74
传热模拟,121–125
heat transfer simulation, 121–125
animExit(), 149
animExit(), 149
异步调用
asynchronous call
cudaMemcpyAsync()作为,197
cudaMemcpyAsync() as, 197
使用事件,109
using events with, 109
原子锁
atomic locks
GPU 哈希表,274–275
GPU hash table, 274–275
概述,251–254
overview of, 251–254
atomicAdd()
atomicAdd()
原子锁,251–254
atomic locks, 251–254
使用全局内存的直方图内核,180
histogram kernel using global memory, 180
不支持浮点数,251
not supporting floating-point numbers, 251
atomicCAS()、GPU 锁、252–253
atomicCAS(), GPU lock, 252–253
atomicExch()、GPU 锁、253–254
atomicExch(), GPU lock, 253–254
原子,163–184
atomics, 163–184
高级,249–277
advanced, 249–277
NVIDIA GPU 的计算能力,164–167
compute capability of NVIDIA GPUs, 164–167
点积和,248–251
dot product and, 248–251
哈希表。参见哈希表
hash tables. see hash tables
直方图计算,CPU,171–173
histogram computation, CPU, 171–173
直方图计算,GPU,173–179
histogram computation, GPU, 173–179
直方图计算,概述,170
histogram computation, overview, 170
使用全局内存原子的直方图内核,179–181
histogram kernel using global memory atomics, 179–181
使用共享/全局内存原子的直方图内核,181–183
histogram kernel using shared/global memory atomics, 181–183
对于最低计算能力,167–168
for minimum compute capability, 167–168
锁,251–254
locks, 251–254
运营,168–170
operations, 168–170
带宽,持续节省内存,106–107
bandwidth, constant memory saving, 106–107
基本线性代数子程序 (BLAS),CUBLAS 库,239–240
Basic Linear Algebra Subprograms (BLAS), CUBLAS library, 239–240
bin 计数,CPU 直方图计算,171–173
bin counts, CPU histogram computation, 171–173
BLAS(基本线性代数子程序),CUBLAS 库,239–240
BLAS (Basic Linear Algebra Subprograms), CUBLAS library, 239–240
2D 纹理内存,131–133
2D texture memory, 131–133
纹理内存,127–129
texture memory, 127–129
blockDim多变的
blockDim variable
2D 纹理内存,132–133
2D texture memory, 132–133
dot product computation, 76–78, 85
点积计算,不正确的优化,88
dot product computation, incorrect optimization, 88
使用原子锁进行点积计算,255–256
dot product computation with atomic locks, 255–256
点积计算,零拷贝内存,221–222
dot product computation, zero-copy memory, 221–222
GPU 哈希表实现,272
GPU hash table implementation, 272
使用线程的 GPU 波动,72–73
GPU ripple using threads, 72–73
较长向量的 GPU 总和,63–65
GPU sums of a longer vector, 63–65
任意长向量的 GPU 总和,66–67
GPU sums of arbitrarily long vectors, 66–67
图形互操作性,145
graphics interoperability, 145
使用全局内存原子的直方图内核,179–180
histogram kernel using global memory atomics, 179–180
使用共享/全局内存原子的直方图内核,182–183
histogram kernel using shared/global memory atomics, 182–183
多个 CUDA 流,200
multiple CUDA streams, 200
GPU 上的光线追踪,102
ray tracing on GPU, 102
共享内存位图,91
shared memory bitmap, 91
温度更新计算,119–120
temperature update computation, 119–120
blockIdx多变的
blockIdx variable
2D 纹理内存,132–133
2D texture memory, 132–133
定义, 57
defined, 57
dot product computation, 76–77, 85
使用原子锁进行点积计算,255–256
dot product computation with atomic locks, 255–256
点积计算,零拷贝内存,221–222
dot product computation, zero-copy memory, 221–222
GPU 哈希表实现,272
GPU hash table implementation, 272
GPU 朱莉娅套装,53
GPU Julia Set, 53
使用线程的 GPU 波动,72–73
GPU ripple using threads, 72–73
较长向量的 GPU 总和,63–64
GPU sums of a longer vector, 63–64
GPU 向量和,44–45
GPU vector sums, 44–45
图形互操作性,145
graphics interoperability, 145
使用全局内存原子的直方图内核,179–180
histogram kernel using global memory atomics, 179–180
使用共享/全局内存原子的直方图内核,182–183
histogram kernel using shared/global memory atomics, 182–183
多个 CUDA 流,200
multiple CUDA streams, 200
GPU 上的光线追踪,102
ray tracing on GPU, 102
共享内存位图,91
shared memory bitmap, 91
温度更新计算,119–121
temperature update computation, 119–121
块
blocks
定义, 57
defined, 57
GPU 朱莉娅套装,51
GPU Julia Set, 51
GPU 向量和,44–45
GPU vector sums, 44–45
硬件施加的限制,46
hardware-imposed limits on, 46
分裂成线程。查看并行块,分成线程
splitting into threads. see parallel blocks, splitting into threads
乳腺癌,CUDA 应用程序,8–9
breast cancer, CUDA applications for, 8–9
桥接器,连接多个 GPU,224
bridges, connecting multiple GPUs, 224
桶、哈希表
buckets, hash table
概念,259–260
concept of, 259–260
GPU 哈希表实现,269–275
GPU hash table implementation, 269–275
多线程哈希表和,267–268
multithreaded hash tables and, 267–268
bufferObj多变的
bufferObj variable
创造GPUAnimBitmap, 149
creating GPUAnimBitmap, 149
向 CUDA 运行时注册,143
registering with CUDA runtime, 143
注册cudaGraphicsGL-RegisterBuffer(), 151
registering with cudaGraphicsGL-RegisterBuffer(), 151
setting up graphics interoperability, 141, 143–144
缓冲区,声明共享内存,76–77
buffers, declaring shared memory, 76–77
cache[]共享内存变量
cache[] shared memory variable
声明名为76–77的共享内存缓冲区
declaring buffer of shared memory named, 76–77
dot product computation, 79–80, 85–86
使用原子锁进行点积计算,255–256
dot product computation with atomic locks, 255–256
cacheIndex,不正确的点积优化,88
cacheIndex, incorrect dot product optimization, 88
缓存、纹理、116–117
caches, texture, 116–117
回调,GPUAnimBitmap用户注册,149
callbacks, GPUAnimBitmap user registration for, 149
剑桥大学,CUDA 应用程序,9–10
Cambridge University, CUDA applications, 9–10
相机
camera
光线追踪概念,97–98
ray tracing concepts, 97–98
GPU 上的光线追踪,99–104
ray tracing on GPU, 99–104
蜂窝电话,并行处理,2
cellular phones, parallel processing in, 2
中央处理单元。请参阅CPU(中央处理单元)
central processing units. see CPUs (central processing units)
清洁剂,CUDA 应用程序,10–11
cleaning agents, CUDA applications for, 10–11
clickDrag(), 149
clickDrag(), 149
时钟速度,演变,2–3
clock speed, evolution of, 2–3
代码,打破假设,45–46
code, breaking assumptions, 45–46
代码资源,CUDa,246–248
code resources, CUDa, 246–248
冲突解决,哈希表,260–261
collision resolution, hash tables, 260–261
颜色
color
CPU 朱莉娅集,48–49
CPU Julia Set, 48–49
GPU 计算的早期阶段,5-6
early days of GPU computing, 5–6
光线追踪概念,98
ray tracing concepts, 98
编译器
compiler
对于最低计算能力,167–168
for minimum compute capability, 167–168
标准 C,用于 GPU 代码,18–19
standard C, for GPU code, 18–19
复数
complex numbers
定义要存储的通用类,49–50
defining generic class to store, 49–50
用单精度浮点组件存储,54
storing with single-precision floating-point components, 54
计算流体动力学,CUDA 应用程序,9–10
computational fluid dynamics, CUDA applications for, 9–10
计算能力
compute capability
编译最低值,167–168
compiling for minimum, 167–168
cudaChooseDevice()and, 141
cudaChooseDevice()and, 141
定义, 164
defined, 164
NVIDIA GPU 数量,164–167
of NVIDIA GPUs, 164–167
概述,141–142
overview of, 141–142
电脑游戏,3D 图形开发,4–5
computer games, 3D graphic development for, 4–5
持续记忆
constant memory
加速应用程序,95
accelerating applications with, 95
通过事件衡量绩效,108–110
measuring performance with events, 108–110
测量射线追踪器性能,110–114
measuring ray tracer performance, 110–114
概述, 96
overview of, 96
性能,106–107
performance with, 106–107
光线追踪简介,96–98
ray tracing introduction, 96–98
GPU 上的光线追踪,98–104
ray tracing on GPU, 98–104
光线追踪,104–106
ray tracing with, 104–106
总结回顾,114
summary review, 114
__constant__功能
__constant__function
将内存声明为,104–106
declaring memory as, 104–106
恒定内存性能,106–107
performance with constant memory, 106–107
copy_const_kernel()核心
copy_const_kernel() kernel
2D 纹理内存,133
2D texture memory, 133
使用纹理内存,129–130
using texture memory, 129–130
copy_constant_kernel(),计算温度更新,119–121
copy_constant_kernel(), computing -temperature updates, 119–121
CPUAnimBitmap类,创建 GPU 波纹,69–70 , 147–148
CPUAnimBitmap class, creating GPU ripple, 69–70, 147–148
CPU(中央处理单元)
CPUs (central processing units)
时钟速度的演变,2–3
evolution of clock speed, 2–3
核心数量的演变,3
evolution of core count, 3
释放内存。参见 free()C语言
freeing memory. see free(), C language
哈希表,261–267
hash tables, 261–267
直方图计算,171–173
histogram computation on, 171–173
作为本书的主持人,23
as host in this book, 23
线程管理和调度,72
thread management and scheduling in, 72
向量和,39–41
vector sums, 39–41
使用反向 CPU 直方图验证 GPU 直方图,175–176
verifying GPU histogram using reverse CPU histogram, 175–176
CUBLAS 图书馆,239–240
CUBLAS library, 239–240
cuComplex结构,CPU Julia 集,48–49
cuComplex structure, CPU Julia Set, 48–49
cuComplex结构,GPU Julia 集,53–55
cuComplex structure, GPU Julia Set, 53–55
CUDA,大众超级计算,245–246
CUDA, Supercomputing for the Masses, 245–246
CUDA架构
CUDA Architecture
计算流体动力学应用,9–10
computational fluid dynamic applications, 9–10
定义, 7
defined, 7
环境科学应用,10–11
environmental science applications, 10–11
第一次申请,7
first application of, 7
医学成像应用,8–9
medical imaging applications, 8–9
理解资源,244–245
resource for understanding, 244–245
使用,7–8
using, 7–8
CUDA C
CUDA C
计算流体动力学应用,9–10
computational fluid dynamic applications, 9–10
CUDA 开发工具包,16–18
CUDA development toolkit, 16–18
支持 CUDA 的图形处理器,14–16
CUDA-enabled graphics processor, 14–16
调试,241–242
debugging, 241–242
开发环境设置。查看开发环境设置
development environment setup. see development environment setup
的发展, 7
development of, 7
环境科学应用,10–11
environmental science applications, 10–11
入门,13–20
getting started, 13–20
医学成像应用,8–9
medical imaging applications, 8–9
NVIDIA 设备驱动程序,16
NVIDIA device driver, 16
在多个 GPU 上。请参阅GPU(图形处理单元)、多系统
on multiple GPUs. see GPUs (graphics processing units), multi-system
概述,21-22
overview of, 21–22
并行编程。参见并行编程、CUDA
parallel programming in. see parallel programming, CUDA
传递参数,24–27
passing parameters, 24–27
查询设备,27–33
querying devices, 27–33
标准 C 编译器,18–19
standard C compiler, 18–19
使用设备属性,33–35
using device properties, 33–35
编写第一个程序,22–24
writing first program, 22–24
CUDA 数据并行基元库 (CUDPP),246
CUDA Data Parallel Primitives Library (CUDPP), 246
CUDA 事件 API 和性能,108–110
CUDA event API, and performance, 108–110
CUDA 内存检查器,242
CUDA Memory Checker, 242
CUDA 流
CUDA streams
GPU 工作调度,205–208
GPU work scheduling with, 205–208
概述, 192
overview of, 192
单身,192–198
single, 192–198
总结回顾,211
summary review, 211
CUDA 工具包,238–240
CUDA Toolkit, 238–240
在开发环境中,16–18
in development environment, 16–18
CUDA工具
CUDA tools
CUBLAS 图书馆,239–240
CUBLAS library, 239–240
CUDA 工具包,238–239
CUDA Toolkit, 238–239
CUFFT 图书馆,239
CUFFT library, 239
调试 CUDA C,241–242
debugging CUDA C, 241–242
GPU 计算 SDK 下载,240–241
GPU Computing SDK download, 240–241
NVIDIA 性能基元,241
NVIDIA Performance Primitives, 241
概述, 238
overview of, 238
视觉分析器,243–244
Visual Profiler, 243–244
CUDA 区,167
CUDA Zone, 167
cuda_malloc_test(),页锁定内存,189
cuda_malloc_test(), page-locked memory, 189
cudaBindTexture(),纹理内存,126–127
cudaBindTexture(), texture memory, 126–127
cudaBindTexture2D(),纹理内存,134
cudaBindTexture2D(), texture memory, 134
cudaChannelFormatDesc(),绑定 2D 纹理,134
cudaChannelFormatDesc(), binding 2D textures, 134
cudaChooseDevice()
cudaChooseDevice()
定义, 34
defined, 34
GPUAnimBitmap初始化,150
GPUAnimBitmap initialization, 150
对于有效身份证件,141–142
for valid ID, 141–142
cudaD39SetDirect3DDevice(), DirectX 互操作性, 160–161
cudaD39SetDirect3DDevice(), DirectX interoperability, 160–161
cudaDeviceMapHost(),零拷贝内存点积,221
cudaDeviceMapHost(), zero-copy memory dot product, 221
cudaDeviceProp结构
cudaDeviceProp structure
cudaChooseDevice()与141一起工作
cudaChooseDevice()working with, 141
多个 CUDA 流,200
multiple CUDA streams, 200
概述,28-31
overview of, 28–31
单个 CUDA 流,193–194
single CUDA streams, 193–194
使用设备属性,34
using device properties, 34
支持 CUDA 的图形处理器,14–16
CUDA-enabled graphics processors, 14–16
cudaEventCreate()
cudaEventCreate()
2D 纹理内存,134
2D texture memory, 134
CUDA流、192、194、201
GPU 哈希表实现,274–275
GPU hash table implementation, 274–275
GPU histogram computation, 173, 177
measuring performance with events, 108–110, 112
页锁定主机内存应用,188–189
page-locked host memory application, 188–189
GPUAnimBitmap用, 158执行动画
performing animation with GPUAnimBitmap, 158
GPU 上的光线追踪,100
ray tracing on GPU, 100
标准主机内存点积,215
standard host memory dot product, 215
纹理内存,124
texture memory, 124
zero-copy host memory, 215, 217
定义, 112
defined, 112
GPU 哈希表实现,275
GPU hash table implementation, 275
GPU histogram computation, 176, 178
heat transfer simulation, 123, 131, 137
通过事件衡量绩效,111–113
measuring performance with events, 111–113
页锁定主机内存,189–190
page-locked host memory, 189–190
纹理内存,136
texture memory, 136
zero-copy host memory, 217, 220
cudaEventElapsedTime()
cudaEventElapsedTime()
2D纹理内存,130
2D texture memory, 130
定义, 112
defined, 112
GPU 哈希表实现,275
GPU hash table implementation, 275
GPU histogram computation, 175, 178
传热模拟动画,122
heat transfer simulation animation, 122
使用图形互操作性进行热传递,157
heat transfer using graphics interoperability, 157
page-locked host memory, 188, 190
标准主机内存点积,216
standard host memory dot product, 216
零拷贝内存点积,219
zero-copy memory dot product, 219
cudaEventRecord()
cudaEventRecord()
CUDA流、194、198、201
CUDA 流和,192
CUDA streams and, 192
GPU 哈希表实现,274–275
GPU hash table implementation, 274–275
GPU histogram computation, 173, 175, 177
传热模拟动画,122
heat transfer simulation animation, 122
使用图形互操作性进行热传递,156–157
heat transfer using graphics interoperability, 156–157
通过事件衡量绩效,108–109
measuring performance with events, 108–109
测量射线追踪器性能,110–113
measuring ray tracer performance, 110–113
页锁定主机内存,188–190
page-locked host memory, 188–190
GPU 上的光线追踪,100
ray tracing on GPU, 100
标准主机内存点积,216
standard host memory dot product, 216
使用纹理内存,129–130
using texture memory, 129–130
cudaEventSynchronize()
cudaEventSynchronize()
2D纹理内存,130
2D texture memory, 130
GPU 哈希表实现,275
GPU hash table implementation, 275
GPU histogram computation, 175, 178
传热模拟动画,122
heat transfer simulation animation, 122
使用图形互操作性进行热传递,157
heat transfer using graphics interoperability, 157
用事件衡量绩效,109、111、113
measuring performance with events, 109, 111, 113
page-locked host memory, 188, 190
标准主机内存点积,216
standard host memory dot product, 216
cudaFree()
cudaFree()
分配便携式固定内存,235
allocating portable pinned memory, 235
CPU 向量和,42
CPU vector sums, 42
定义,26–27
defined, 26–27
dot product computation, 84, 87
使用原子锁进行点积计算,258
dot product computation with atomic locks, 258
GPU 哈希表实现,269–270,275
GPU hash table implementation, 269–270, 275
使用线程的 GPU 纹波,69
GPU ripple using threads, 69
任意长向量的 GPU 总和,69
GPU sums of arbitrarily long vectors, 69
多个 CPU,229
multiple CPUs, 229
页锁定主机内存,189–190
page-locked host memory, 189–190
GPU 上的光线追踪,101
ray tracing on GPU, 101
具有恒定内存的光线追踪,105
ray tracing with constant memory, 105
共享内存位图,91
shared memory bitmap, 91
标准主机内存点积,217
standard host memory dot product, 217
cudaFreeHost()
cudaFreeHost()
分配便携式固定内存,233
allocating portable pinned memory, 233
定义,190
defined, 190
释放分配给cudaHostAlloc(), 190 的缓冲区
freeing buffer allocated with cudaHostAlloc(), 190
零拷贝内存点积,220
zero-copy memory dot product, 220
CUDA-GDB 调试工具,241–242
CUDA-GDB debugging tool, 241–242
cudaGetDevice()
cudaGetDevice()
设备属性,34
device properties, 34
零拷贝内存点积,220
zero-copy memory dot product, 220
cudaGetDeviceCount()
cudaGetDeviceCount()
设备属性,34
device properties, 34
获取 CUDA 设备的数量,28
getting count of CUDA devices, 28
多个 CPU,224–225
multiple CPUs, 224–225
cudaGetDeviceProperties()
cudaGetDeviceProperties()
确定 GPU 是集成的还是离散的,223
determining if GPU is integrated or discrete, 223
多个 CUDA 流,200
multiple CUDA streams, 200
查询设备,33–35
querying devices, 33–35
零拷贝内存点积,220
zero-copy memory dot product, 220
cudaGLSetGLDevice()
cudaGLSetGLDevice()
与 OpenGL 的图形互操作,150
graphics interoperation with OpenGL, 150
准备 CUDA 使用 OpenGL 驱动程序,142
preparing CUDA to use OpenGL driver, 142
cudaGraphicsGLRegisterBuffer(), 143 , 151
cudaGraphicsGLRegisterBuffer(), 143, 151
cudaGraphicsMapFlagsNone(), 143
cudaGraphicsMapFlagsNone(), 143
cudaGraphicsMapFlagsReadOnly(), 143
cudaGraphicsMapFlagsReadOnly(), 143
cudaGraphicsMapFlagsWriteDiscard(), 143
cudaGraphicsMapFlagsWriteDiscard(), 143
cudaGraphicsUnapResources(), 144
cudaGraphicsUnapResources(), 144
cudaHostAlloc()
cudaHostAlloc()
malloc()对比,186–187
malloc() versus, 186–187
页锁定主机内存应用,187–192
page-locked host memory application, 187–192
零拷贝内存点积,217–220
zero-copy memory dot product, 217–220
cudaHostAllocDefault()
cudaHostAllocDefault()
默认固定内存,214
default pinned memory, 214
页锁定主机内存,189–190
page-locked host memory, 189–190
cudaHostAllocMapped()旗帜
cudaHostAllocMapped()flag
默认固定内存,214
default pinned memory, 214
便携式固定存储器,231
portable pinned memory, 231
零拷贝内存点积,217–218
zero-copy memory dot product, 217–218
cudaHostAllocPortable(),便携式固定存储器,230–235
cudaHostAllocPortable(), portable pinned memory, 230–235
cudaHostAllocWriteCombined()旗帜
cudaHostAllocWriteCombined()flag
便携式固定存储器,231
portable pinned memory, 231
零拷贝内存点积,217–218
zero-copy memory dot product, 217–218
便携式固定存储器,234
portable pinned memory, 234
零拷贝内存点积,218–219
zero-copy memory dot product, 218–219
cudaMalloc(), 124
cudaMalloc(), 124
2D 纹理内存,133–135
2D texture memory, 133–135
使用分配设备内存,26
allocating device memory using, 26
CPU 向量和应用,42
CPU vector sums application, 42
CUDA流,194、201–202
dot product computation, 82, 86
点积计算,标准主机内存,215
dot product computation, standard host memory, 215
使用原子锁进行点积计算,256
dot product computation with atomic locks, 256
GPU 哈希表实现,269、274–275
GPU hash table implementation, 269, 274–275
GPU 朱莉娅套装,51
GPU Julia Set, 51
GPU锁定功能,253
GPU lock function, 253
使用线程的 GPU 纹波,70
GPU ripple using threads, 70
任意长向量的 GPU 总和,68
GPU sums of arbitrarily long vectors, 68
measuring ray tracer performance, 110, 112
便携式固定存储器,234
portable pinned memory, 234
GPU 上的光线追踪,100
ray tracing on GPU, 100
具有恒定内存的光线追踪,105
ray tracing with constant memory, 105
共享内存位图,90
shared memory bitmap, 90
使用多个CPU,228
using multiple CPUs, 228
使用纹理内存,127
using texture memory, 127
cuda-memcheck, 242
cuda-memcheck, 242
cudaMemcpy()
cudaMemcpy()
2D 纹理绑定,136
2D texture binding, 136
在主机和设备之间复制数据,27
copying data between host and device, 27
CPU 向量和应用,42
CPU vector sums application, 42
dot product computation, 82–83, 86
使用原子锁进行点积计算,257
dot product computation with atomic locks, 257
GPU 哈希表实现,270、274–275
GPU hash table implementation, 270, 274–275
GPU 直方图计算,174–175
GPU histogram computation, 174–175
GPU 朱莉娅套装,52
GPU Julia Set, 52
GPU锁定功能实现,253
GPU lock function implementation, 253
使用线程的 GPU 纹波,70
GPU ripple using threads, 70
任意长向量的 GPU 总和,68
GPU sums of arbitrarily long vectors, 68
传热模拟动画,122–125
heat transfer simulation animation, 122–125
测量射线追踪器性能,111
measuring ray tracer performance, 111
page-locked host memory and, 187, 189
GPU 上的光线追踪,101
ray tracing on GPU, 101
标准主机内存点积,216
standard host memory dot product, 216
使用多个 CPU,228–229
using multiple CPUs, 228–229
cudaMemcpyAsync()
cudaMemcpyAsync()
GPU 工作调度,206–208
GPU work scheduling, 206–208
多个 CUDA流,203、208–210
multiple CUDA streams, 203, 208–210
单 CUDA 流,196
single CUDA streams, 196
使用多个流执行预期应用程序的时间表,199
timeline of intended application execution using multiple streams, 199
cudaMemcpyDeviceToHost()
cudaMemcpyDeviceToHost()
CPU 向量和应用,42
CPU vector sums application, 42
dot product computation, 82, 86–87
GPU 哈希表实现,270
GPU hash table implementation, 270
GPU 直方图计算,174–175
GPU histogram computation, 174–175
GPU 朱莉娅套装,52
GPU Julia Set, 52
任意长向量的 GPU 总和,68
GPU sums of arbitrarily long vectors, 68
多个 CUDA 流,204
multiple CUDA streams, 204
页锁定主机内存,190
page-locked host memory, 190
GPU 上的光线追踪,101
ray tracing on GPU, 101
共享内存位图,91
shared memory bitmap, 91
标准主机内存点积,216
standard host memory dot product, 216
使用多个 CPU,229
using multiple CPUs, 229
cudaMemcpyHostToDevice()
cudaMemcpyHostToDevice()
CPU 向量和应用,42
CPU vector sums application, 42
点积计算,86
dot product computation, 86
任意长向量的 GPU 总和,68
GPU sums of arbitrarily long vectors, 68
实现GPU锁定功能,253
implementing GPU lock function, 253
测量射线追踪器性能,111
measuring ray tracer performance, 111
多个CPU,228
multiple CPUs, 228
多个 CUDA 流,203
multiple CUDA streams, 203
页锁定主机内存,189
page-locked host memory, 189
标准主机内存点积,216
standard host memory dot product, 216
cudaMemcpyToSymbol(),恒定记忆,105–106
cudaMemcpyToSymbol(), constant memory, 105–106
cudaMemset()
cudaMemset()
GPU 哈希表实现,269
GPU hash table implementation, 269
GPU直方图计算,174
GPU histogram computation, 174
CUDA.NET 项目,247
CUDA.NET project, 247
cudaSetDevice()
cudaSetDevice()
分配便携式固定内存,231–232、233–234
allocating portable pinned memory, 231–232, 233–234
使用设备属性,34
using device properties, 34
使用多个 CPU,227–228
using multiple CPUs, 227–228
cudaSetDeviceFlags()
cudaSetDeviceFlags()
allocating portable pinned memory, 231, 234
零拷贝内存点积,221
zero-copy memory dot product, 221
cudaStreamDestroy(), 198 , 205
cudaStreamSynchronize(), 197–198 , 204
cudaStreamSynchronize(), 197–198, 204
cudaThreadSynchronize(), 219
cudaThreadSynchronize(), 219
cudaUnbindTexture(),2D 纹理内存,136–137
cudaUnbindTexture(), 2D texture memory, 136–137
CUDPP(CUDA 数据并行基元库),246
CUDPP (CUDA Data Parallel Primitives Library), 246
CUFFT 图书馆,239
CUFFT library, 239
CULA 工具,246
CULAtools, 246
当前动画时间,使用线程的 GPU 纹波,72–74
current animation time, GPU ripple using threads, 72–74
调试 CUDA C,241–242
debugging CUDA C, 241–242
清洁剂,CUDA 应用,10–11
detergents, CUDA applications, 10–11
dev_bitmap指针,GPU Julia 集,51
dev_bitmap pointer, GPU Julia Set, 51
开发环境设置
development environment setup
CUDA 工具包,16–18
CUDA Toolkit, 16–18
支持 CUDA 的图形处理器,14–16
CUDA-enabled graphics processor, 14–16
NVIDIA 设备驱动程序,16
NVIDIA device driver, 16
标准 C 编译器,18–19
standard C compiler, 18–19
总结回顾,19
summary review, 19
设备驱动程序,16
device drivers, 16
设备重叠、GPU 、194、198–199
device overlap, GPU, 194, 198–199
__device__功能
__device__function
GPU 哈希表实现,268–275
GPU hash table implementation, 268–275
GPU 朱莉娅套装,54
GPU Julia Set, 54
设备
devices
获取 CUDA 的计数,28
getting count of CUDA, 28
GPU 向量和,41–46
GPU vector sums, 41–46
传递参数,25–27
passing parameters, 25–27
查询,27–33
querying, 27–33
本书中术语的使用,23
use of term in this book, 23
使用33–35的属性
using properties of, 33–35
devPtr,图形互操作性,144
devPtr, graphics interoperability, 144
dim3可变网格,GPU Julia 集,51–52
dim3 variable grid, GPU Julia Set, 51–52
DIMxDIM位图图像,GPU Julia 集,49–51 , 53
DIMxDIM bitmap image, GPU Julia Set, 49–51, 53
直接内存访问 (DMA),用于页锁定内存,186
direct memory access (DMA), for page-locked memory, 186
直接X
DirectX
添加标准C,7
adding standard C to, 7
GPU 技术的突破,5-6
breakthrough in GPU technology, 5–6
GeForce 8800 GTX,7
GeForce 8800 GTX, 7
图形互操作性,160–161
graphics interoperability, 160–161
独立 GPU,222–224
discrete GPUs, 222–224
显示加速器,2D,4
display accelerators, 2D, 4
DMA(直接内存访问),用于页锁定内存,186
DMA (direct memory access), for page-locked memory, 186
点积计算
dot product computation
优化不正确,87–90
optimized incorrectly, 87–90
共享内存,76–87
shared memory and, 76–87
标准主机内存版本,215–217
standard host memory version of, 215–217
使用原子完全保留在 GPU 上,250–251、254–258
using atomics to keep entirely on GPU, 250–251, 254–258
点积计算,多个 GPU
dot product computation, multiple GPUs
分配便携式固定内存,230–235
allocating portable pinned memory, 230–235
使用,224–229
using, 224–229
零拷贝,217–222
zero-copy, 217–222
零拷贝性能,223
zero-copy performance, 223
Dobb 博士的 CUDA,245–246
Dr. Dobb’s CUDA, 245–246
DRAM、具有自己专用的独立 GPU、222–223
DRAMs, discrete GPUs with own dedicated, 222–223
draw_func,图形互操作性,144–146
draw_func, graphics interoperability, 144–146
end_thread(),多个 CPU,226
end_thread(), multiple CPUs, 226
环境科学,CUDA 应用程序,10–11
environmental science, CUDA applications for, 10–11
事件计时器。查看计时器、事件
event timer. see timer, event
事件
events
计算记录之间经过的时间。看 cudaEventElapsedTime()
computing elapsed time between recorded. see cudaEventElapsedTime()
创造。看 cudaEventCreate()
creating. see cudaEventCreate()
GPU直方图计算,173
GPU histogram computation, 173
测量性能,95
measuring performance with, 95
测量射线追踪器性能,110–114
measuring ray tracer performance, 110–114
概述,108-110
overview of, 108–110
记录。看 cudaEventRecord()
recording. see cudaEventRecord()
停止和启动。看 cudaEventDestroy()
stopping and starting. see cudaEventDestroy()
总结回顾,114
summary review, 114
EXIT_FAILURE(),传递参数,26
EXIT_FAILURE(), passing parameters, 26
fAnim(),存储注册的回调,149
fAnim(), storing registered callbacks, 149
快速傅里叶变换库,NVIDIA,239
Fast Fourier Transform library, NVIDIA, 239
第一个节目,写作,22-24
first program, writing, 22–24
图形互操作性中的标志,143
flags, in graphics interoperability, 143
float_to_color()内核,图形互操作性,157
float_to_color() kernels, in graphics -interoperability, 157
浮点数字
floating-point numbers
不支持原子算术,251
atomic arithmetic not supported for, 251
CUDA 架构设计用于7
CUDA Architecture designed for, 7
早期的 GPU 计算无法处理,6
early days of GPU computing not able to handle, 6
FORTRAN 应用程序
FORTRAN applications
CUBLAS 兼容性,239–240
CUBLAS compatibility with, 239–240
CUDA C 的语言包装器,246
language wrapper for CUDA C, 246
论坛,NVIDIA,246
forums, NVIDIA, 246
分形。请参阅Julia 设置示例
fractals. see Julia Set example
free()、C语言
free(), C language
cudaFree( )对比,26-27
cudaFree( )versus, 26–27
使用原子锁进行点积计算,258
dot product computation with atomic locks, 258
GPU 哈希表实现,275
GPU hash table implementation, 275
多个 CPU,227
multiple CPUs, 227
标准主机内存点积,217
standard host memory dot product, 217
GeForce 256, 5
GeForce 256, 5
GeForce 8800 GTX,7
GeForce 8800 GTX, 7
generate_frame(), GPU 纹波, 70 , 72–73 , 154
generate_frame(), GPU ripple, 70, 72–73, 154
泛型类,存储复数,49–50
generic classes, storing complex numbers with, 49–50
GL_PIXEL_UNPACK_BUFFER_ARB目标,OpenGL 互操作,151
GL_PIXEL_UNPACK_BUFFER_ARB target, OpenGL interoperation, 151
glBindBuffer()
glBindBuffer()
创建像素缓冲区对象,143
creating pixel buffer object, 143
图形互操作性,146
graphics interoperability, 146
glBufferData(),像素缓冲对象,143
glBufferData(), pixel buffer object, 143
glDrawPixels()
glDrawPixels()
图形互操作性,146
graphics interoperability, 146
概述,154–155
overview of, 154–155
glGenBuffers(),像素缓冲对象,143
glGenBuffers(), pixel buffer object, 143
全局内存原子
global memory atomics
GPU 计算能力要求,167
GPU compute capability requirements, 167
使用直方图内核,179–181
histogram kernel using, 179–181
使用共享和的直方图内核,181–183
histogram kernel using shared and, 181–183
add函数, 43
add function, 43
内核调用,23–24
kernel call, 23–24
kernel()在 GPU Julia Set 应用程序中运行, 51–52
running kernel() in GPU Julia Set application, 51–52
GLUT(GL实用工具包)
GLUT (GL Utility Toolkit)
图形互操作性设置,144
graphics interoperability setup, 144
初始化,150
initialization of, 150
通过调用初始化 OpenGL 驱动程序,142
initializing OpenGL driver by calling, 142
glutIdleFunc(), 149
glutIdleFunc(), 149
glutInit(), 150
glutInit(), 150
glutMainLoop(), 144
glutMainLoop(), 144
GPU 计算 SDK下载,18,240–241
GPU Computing SDK download, 18, 240–241
GPU纹波
GPU ripple
具有图形互操作性,147–154
with graphics interoperability, 147–154
使用线程,69–74
using threads, 69–74
GPU 向量和
GPU vector sums
应用程序,41–46
application, 41–46
任意长的向量,使用线程,65–69
of arbitrarily long vectors, using threads, 65–69
更长的向量,使用线程,63–65
of longer vector, using threads, 63–65
使用线程,61–63
using threads, 61–63
gpu_anim.h, 152–154
gpu_anim.h, 152–154
GPUAnimBitmap结构
GPUAnimBitmap structure
创造,148–152
creating, 148–152
GPU 波纹执行动画,152–154
GPU ripple performing animation, 152–154
具有图形互操作性的热传递,156–160
heat transfer with graphics interoperability, 156–160
GPU(图形处理单元)
GPUs (graphics processing units)
本书中称为“设备”,23
called “devices” in this book, 23
在支持 CUDA 的情况下使用 CUDA C 开发代码,14–16
developing code in CUDA C with CUDA-enabled, 14–16
CUDA 开发,6-8
development of CUDA for, 6–8
分立与集成,222–223
discrete versus integrated, 222–223
早期,5-6
early days of, 5–6
释放内存。看 cudaFree()
freeing memory. see cudaFree()
哈希表,268–275
hash tables, 268–275
直方图计算,173–179
histogram computation on, 173–179
使用全局内存原子的直方图内核,179–181
histogram kernel using global memory atomics, 179–181
使用共享/全局内存原子的直方图内核,181–183
histogram kernel using shared/global memory atomics, 181–183
历史, 4–5
history of, 4–5
朱莉娅树立榜样,50–57
Julia Set example, 50–57
通过事件衡量绩效,108–110
measuring performance with events, 108–110
光线追踪开启,98–104
ray tracing on, 98–104
工作安排,205–208
work scheduling, 205–208
GPU(图形处理单元),多个,213–236
GPUs (graphics processing units), multiple, 213–236
概述,213-214
overview of, 213–214
便携式固定内存,230–235
portable pinned memory, 230–235
总结回顾,235–236
summary review, 235–236
使用,224–229
using, 224–229
零拷贝主机内存,214–222
zero-copy host memory, 214–222
零拷贝性能,222–223
zero-copy performance, 222–223
图形加速器,3D 图形,4–5
graphics accelerators, 3D graphics, 4–5
图形互操作性,139–161
graphics interoperability, 139–161
DirectX,160–161
DirectX, 160–161
使用内核生成图像数据,139–142
generating image data with kernel, 139–142
GPU 波纹,147–154
GPU ripple with, 147–154
传热,154–160
heat transfer with, 154–160
概述,139–140
overview of, 139–140
将图像数据传递给 Open GL 进行渲染,142–147
passing image data to Open GL for rendering, 142–147
总结回顾,161
summary review, 161
图形处理单元。请参阅GPU(图形处理单元)
graphics processing units. see GPUs (graphics processing units)
grey(), GPU 纹波, 74
grey(), GPU ripple, 74
网格
grid
作为并行块的集合,45
as collection of parallel blocks, 45
定义, 57
defined, 57
三维,51
three-dimensional, 51
gridDim多变的
gridDim variable
2D 纹理内存,132–133
2D texture memory, 132–133
定义, 57
defined, 57
点积计算,77–78
dot product computation, 77–78
使用原子锁进行点积计算,255–256
dot product computation with atomic locks, 255–256
GPU 哈希表实现,272
GPU hash table implementation, 272
GPU 朱莉娅套装,53
GPU Julia Set, 53
使用线程的 GPU 波动,72–73
GPU ripple using threads, 72–73
任意长向量的 GPU 总和,66–67
GPU sums of arbitrarily long vectors, 66–67
图形互操作性设置,145
graphics interoperability setup, 145
使用全局内存原子的直方图内核,179–180
histogram kernel using global memory atomics, 179–180
使用共享/全局内存原子的直方图内核,182–183
histogram kernel using shared/global memory atomics, 182–183
GPU 上的光线追踪,102
ray tracing on GPU, 102
共享内存位图,91
shared memory bitmap, 91
温度更新计算,119–120
temperature update computation, 119–120
零拷贝内存点积,222
zero-copy memory dot product, 222
半扭曲,读取恒定内存,107
half-warps, reading constant memory, 107
HANDLE_ERROR()宏
HANDLE_ERROR() macro
2D 纹理内存,133–136
2D texture memory, 133–136
CUDA流,194–198、201–204、209–210
CUDA streams, 194–198, 201–204, 209–210
dot product computation, 82–83, 86–87
使用原子锁进行点积计算,256–258
dot product computation with atomic locks, 256–258
GPU 哈希表实现,270
GPU hash table implementation, 270
GPU直方图计算完成,175
GPU histogram computation completion, 175
GPU锁定功能实现,253
GPU lock function implementation, 253
使用线程的 GPU 纹波,70
GPU ripple using threads, 70
任意长向量的 GPU 总和,68
GPU sums of arbitrarily long vectors, 68
传热模拟动画,122–125
heat transfer simulation animation, 122–125
测量射线追踪器性能,110–114
measuring ray tracer performance, 110–114
页锁定主机内存应用,188–189
page-locked host memory application, 188–189
传递参数,26
passing parameters, 26
关注, 46
paying attention to, 46
便携式固定存储器,231–235
portable pinned memory, 231–235
GPU 上的光线追踪,100–101
ray tracing on GPU, 100–101
具有恒定内存的光线追踪,104–105
ray tracing with constant memory, 104–105
共享内存位图,90–91
shared memory bitmap, 90–91
标准主机内存点积,215–217
standard host memory dot product, 215–217
零拷贝内存点积,217–222
zero-copy memory dot product, 217–222
硬件
hardware
将并行化与执行方法解耦,66
decoupling parallelization from method of executing, 66
对内存执行原子操作,167
performing atomic operations on memory, 167
硬件限制
hardware limitations
任意长向量的 GPU 总和,65–69
GPU sums of arbitrarily long vectors, 65–69
单次发射的区块数量,46
number of blocks in single launch, 46
内核启动时每个块的线程数,63
number of threads per block in kernel launch, 63
哈希函数
hash function
CPU 哈希表实现,261–267
CPU hash table implementation, 261–267
GPU 哈希表实现,268–275
GPU hash table implementation, 268–275
概述,259–261
overview of, 259–261
哈希表
hash tables
概念,259–261
concepts, 259–261
中央处理器,261–267
CPU, 261–267
GPU,268–275
GPU, 268–275
多线程,267–268
multithreaded, 267–268
性能,276–277
performance, 276–277
总结回顾,277
summary review, 277
传热模拟
heat transfer simulation
2D 纹理内存,131–137
2D texture memory, 131–137
动画,121–125
animating, 121–125
计算温度更新,119–121
computing temperature updates, 119–121
具有图形互操作性,154–160
with graphics interoperability, 154–160
简单加热模型,117–118
simple heating model, 117–118
使用纹理内存,125–131
using texture memory, 125–131
“你好,世界”示例
“Hello, World” example
内核调用,23–24
kernel call, 23–24
传递参数,24–27
passing parameters, 24–27
编写第一个程序,22–23
writing first program, 22–23
高度优化的面向对象的多粒子动力学 (HOOMD),10–11
Highly Optimized Object-oriented Many-particle Dynamics (HOOMD), 10–11
直方图计算
histogram computation
在 CPU 上,171–173
on CPUs, 171–173
在 GPU 上,173–179
on GPUs, 173–179
概述,170
overview, 170
直方图核
histogram kernel
使用全局内存原子,179–181
using global memory atomics, 179–181
使用共享/全局内存原子,181–183
using shared/global memory atomics, 181–183
hit() method, ray tracing on GPU, 99, 102
HOOMD(高度优化的面向对象的多粒子动力学),10–11
HOOMD (Highly Optimized Object-oriented Many-particle Dynamics), 10–11
主机
hosts
分配内存给.看 malloc()
allocating memory to. see malloc()
CPU 向量和,39–41
CPU vector sums, 39–41
CUDA C 模糊设备代码和,26
CUDA C blurring device code and, 26
页锁定内存,186–192
page-locked memory, 186–192
传递参数,25–27
passing parameters, 25–27
本书中术语的使用,23
use of term in this book, 23
零拷贝主机内存,214–222
zero-copy host memory, 214–222
idle_func()会员, GPUAnimBitmap, 154
idle_func() member, GPUAnimBitmap, 154
IEEE 要求,ALU,7
IEEE requirements, ALUs, 7
增量运算符 ( x++),168–170
increment operator (x++), 168–170
初始化
initialization
CPU hash table implementation, 263, 266
CPU直方图计算,171
CPU histogram computation, 171
GPUAnimBitmap, 149
GPUAnimBitmap, 149
内积。请参阅点积计算
inner products. see dot product computation
集成 GPU,222–224
integrated GPUs, 222–224
交错运算,169–170
interleaved operations, 169–170
互操作。请参阅图形互操作性
interoperation. see graphics interoperability
朱莉娅设置示例
Julia Set example
CPU 应用,47–50
CPU application of, 47–50
GPU 应用,50–57
GPU application of, 50–57
概述,46–47
overview of, 46–47
核心
kernel
2D 纹理内存,131–133
2D texture memory, 131–133
blockIdx.x变量,44
blockIdx.x variable, 44
致电23–24
call to, 23–24
定义, 23
defined, 23
GPU 直方图计算,176–178
GPU histogram computation, 176–178
GPU 朱莉娅集,49–52
GPU Julia Set, 49–52
GPU 波纹执行动画,154
GPU ripple performing animation, 154
使用线程的 GPU 纹波,70–72
GPU ripple using threads, 70–72
较长向量的 GPU 总和,63–65
GPU sums of a longer vector, 63–65
graphics interoperability, 139–142, 144–146
“Hello, World” 呼叫示例,23–24
“Hello, World” example of call to, 23–24
启动时尖括号中的数字不是1、43–44
launching with number in angle brackets that is not 1, 43–44
将参数传递给24–27
passing parameters to, 24–27
GPU 上的光线追踪,102–104
ray tracing on GPU, 102–104
纹理内存,127–131
texture memory, 127–131
key_func,图形互操作性,144–146
key_func, graphics interoperability, 144–146
CPU 哈希表实现,261–267
CPU hash table implementation, 261–267
GPU 哈希表实现,269–275
GPU hash table implementation, 269–275
哈希表概念,259–260
hash table concepts, 259–260
语言包装,246–247
language wrappers, 246–247
LAPACK(线性代数包),246
LAPACK (Linear Algebra Package), 246
灯光效果,光线追踪概念,97
light effects, ray tracing concepts, 97
Linux,标准 C 编译器,19
Linux, standard C compiler for, 19
Lock structure, 254–258, 268–275
锁,原子,251–254
locks, atomic, 251–254
Macintosh OS X,标准 C 编译器,19
Macintosh OS X, standard C compiler, 19
main()常规
main()routine
2D 纹理内存,133–136
2D texture memory, 133–136
CPU 哈希表实现,266–267
CPU hash table implementation, 266–267
CPU直方图计算,171
CPU histogram computation, 171
点积计算,81–84
dot product computation, 81–84
使用原子锁进行点积计算,255–256
dot product computation with atomic locks, 255–256
GPU 哈希表实现,273–275
GPU hash table implementation, 273–275
GPU直方图计算,173
GPU histogram computation, 173
使用线程的 GPU 波动,69–70
GPU ripple using threads, 69–70
GPU 向量和,41–42
GPU vector sums, 41–42
图形互操作性,144
graphics interoperability, 144
页锁定主机内存应用,190–192
page-locked host memory application, 190–192
GPU 上的光线追踪,99–100
ray tracing on GPU, 99–100
具有恒定内存的光线追踪,104–106
ray tracing with constant memory, 104–106
共享内存位图,90
shared memory bitmap, 90
单个 CUDA 流,193–194
single CUDA streams, 193–194
零拷贝内存点积,220–222
zero-copy memory dot product, 220–222
malloc()
malloc()
cudaHostAlloc()对比,186
cudaHostAlloc() versus, 186
cudaHostAlloc()对比,190
cudaHostAlloc()versus, 190
cudaMalloc( )对比,26
cudaMalloc( )versus, 26
GPU 上的光线追踪,100
ray tracing on GPU, 100
乳房 X 光检查、医学成像的 CUDA 应用、9
mammograms, CUDA applications for medical imaging, 9
maxThreadsPerBlock字段,设备属性,63
maxThreadsPerBlock field, device properties, 63
媒体和通信处理器 (MCP),223
media and communications processors (MCPs), 223
医学成像、CUDA 应用程序,8–9
medical imaging, CUDA applications for, 8–9
memcpy()、C语言、27
memcpy(), C language, 27
记忆
memory
分配装置。看 cudaMalloc()
allocating device. see cudaMalloc()
持续的。见常量记忆
constant. see constant memory
CUDA 架构创建对7 的访问
CUDA Architecture creating access to, 7
GPU 计算的早期阶段,6
early days of GPU computing, 6
执行使用已分配的设备代码,70
executing device code that uses allocated, 70
释放。看 cudaFree(); free()、C语言
freeing. see cudaFree(); free(), C language
GPU 直方图计算,173–174
GPU histogram computation, 173–174
页锁定主机(固定),186–192
page-locked host (pinned), 186–192
查询设备,27–33
querying devices, 27–33
共享。查看共享内存
shared. see shared memory
质地。查看纹理内存
texture. see texture memory
本书中术语的使用,23
use of term in this book, 23
内存检查器,CUDA,242
Memory Checker, CUDA, 242
memset()、C语言、174
memset(), C language, 174
Microsoft Windows、Visual Studio C 编译器,18–19
Microsoft Windows, Visual Studio C compiler, 18–19
微软.NET,247
Microsoft.NET, 247
多核革命,CPU 的演变,3
multicore revolution, evolution of CPUs, 3
乘法,向量点积,76
multiplication, in vector dot products, 76
多线程哈希表,267–268
multithreaded hash tables, 267–268
mutex,GPU锁定功能,252–254
mutex, GPU lock function, 252–254
nForce 媒体和通信处理器 (MCP),222–223
nForce media and communications processors (MCPs), 222–223
英伟达
NVIDIA
各种 GPU 的计算能力,164–167
compute capability of various GPUs, 164–167
为消费者创建 3D 图形,5
creating 3D graphics for consumers, 5
为 GPU 创建 CUDA C,7
creating CUDA C for GPU, 7
创建第一个使用 CUDA 架构构建的 GPU,7
creating first GPU built with CUDA Architecture, 7
CUBLAS 图书馆,239–240
CUBLAS library, 239–240
支持 CUDA 的图形处理器,14–16
CUDA-enabled graphics processors, 14–16
CUDA-GDB 调试工具,241–242
CUDA-GDB debugging tool, 241–242
CUFFT 图书馆,239
CUFFT library, 239
设备驱动程序,16
device driver, 16
GPU 计算 SDK下载,18,240–241
GPU Computing SDK download, 18, 240–241
并行 NSight 调试工具,242
Parallel NSight debugging tool, 242
性能原语,241
Performance Primitives, 241
包含多个 GPU 的产品,224
products containing multiple GPUs, 224
视觉分析器,243–244
Visual Profiler, 243–244
NVIDIA CUDA 编程指南,31
NVIDIA CUDA Programming Guide, 31
offset,2D纹理内存,133
offset, 2D texture memory, 133
片上缓存。见常量记忆;纹理记忆
on-chip caching. see constant memory; texture memory
一维块
one-dimensional blocks
较长向量的 GPU 总和,63
GPU sums of a longer vector, 63
二维块对比,44
two-dimensional blocks versus, 44
在线资源。在线查看资源
online resources. see resources, online
OpenGL
OpenGL
创造GPUAnimBitmap,148–152
creating GPUAnimBitmap, 148–152
在 GPU 计算的早期,5-6
in early days of GPU computing, 5–6
使用内核生成图像数据,139–142
generating image data with kernel, 139–142
互操作,142–147
interoperation, 142–147
编写 3D 图形,4
writing 3D graphics, 4
操作,原子,168–170
operations, atomic, 168–170
优化,不正确的点积,87–90
optimization, incorrect dot product, 87–90
页锁定主机内存
page-locked host memory
分配为便携式固定内存,230–235
allocating as portable pinned memory, 230–235
概述,186–187
overview of, 186–187
限制使用,187
restricted use of, 187
单 CUDA 流,195–197
single CUDA streams with, 195–197
平行块
parallel blocks
GPU 朱莉娅套装,51
GPU Julia Set, 51
GPU 向量和,45
GPU vector sums, 45
并行块,分成线程
parallel blocks, splitting into threads
任意长向量的 GPU 总和,65–69
GPU sums of arbitrarily long vectors, 65–69
较长向量的 GPU 总和,63–65
GPU sums of longer vector, 63–65
使用线程的 GPU 向量和,61–63
GPU vector sums using threads, 61–63
概述, 60
overview of, 60
向量和,60–61
vector sums, 60–61
并行 NSight 调试工具,242
Parallel NSight debugging tool, 242
并行处理
parallel processing
CPU 的演变,2-3
evolution of CPUs, 2–3
过去的看法,1
past perception of, 1
并行编程、CUDA
parallel programming, CUDA
CPU 向量和,39–41
CPU vector sums, 39–41
例如,CPU Julia Set 应用程序,47–50
example, CPU Julia Set application, 47–50
例如,GPU Julia Set 应用程序,50–57
example, GPU Julia Set application, 50–57
示例,概述,46–47
example, overview, 46–47
GPU 向量和,41–46
GPU vector sums, 41–46
概述, 38
overview of, 38
总结回顾,56
summary review, 56
对向量求和,38–41
summing vectors, 38–41
parameter passing, 24–27, 40, 72
PC 游戏,3D 图形,4–5
PC gaming, 3D graphics for, 4–5
PCI Express 插槽,添加多个 GPU,224
PCI Express slots, adding multiple GPUs to, 224
表现
performance
恒定记忆和,106–107
constant memory and, 106–107
CPU 的演变,2-3
evolution of CPUs, 2–3
哈希表,276
hash table, 276
启动用于 GPU 直方图计算的内核,176–177
launching kernel for GPU histogram computation, 176–177
用事件来衡量,108–114
measuring with events, 108–114
页锁定主机内存和,187
page-locked host memory and, 187
零拷贝内存和,222–223
zero-copy memory and, 222–223
固定记忆
pinned memory
分配为便携式,230–235
allocating as portable, 230–235
cudaHostAllocDefault()获取默认值,214
cudaHostAllocDefault()getting default, 214
作为页锁定内存。请参阅页锁定主机内存
as page-locked memory. see page-locked host memory
像素缓冲对象 (PBO),OpenGL,142–143
pixel buffer objects (PBO), OpenGL, 142–143
像素着色器,GPU 计算的早期,5–6
pixel shaders, early days of GPU computing, 5–6
像素,每块线程数,70–74
pixels, number of threads per block, 70–74
便携式计算设备,2
portable computing devices, 2
大规模并行处理器编程:实践方法(Kirk、Hwu),244
Programming Massively Parallel Processors: A Hands-on Approach (Kirk, Hwu), 244
特性
properties
cudaDeviceProp结构。参见cudaDeviceProp结构
cudaDeviceProp structure. see -cudaDeviceProp structure
maxThreadsPerBlock设备字段,63
maxThreadsPerBlock field for device, 63
报告装置,31
reporting device, 31
使用设备,33–35
using device, 33–35
PyCUDA 项目,246–247
PyCUDA project, 246–247
CUDA C 的 Python 语言包装器,246
Python language wrappers for CUDA C, 246
查询、设备、27–33
querying, devices, 27–33
光栅化,97
rasterization, 97
光线追踪
ray tracing
背后的概念,96–98
concepts behind, 96–98
具有恒定记忆,104–106
with constant memory, 104–106
在 GPU 上,98–104
on GPU, 98–104
衡量绩效,110–114
measuring performance, 110–114
读-修改-写操作
read-modify-write operations
原子操作如,168–170,251
atomic operations as, 168–170, 251
使用原子锁,251–254
using atomic locks, 251–254
只读存储器。见常量记忆;纹理记忆
read-only memory. see constant memory; texture memory
减少
reductions
点积为,83
dot products as, 83
概述, 250
overview of, 250
共享内存和同步,79–81
shared memory and synchronization for, 79–81
参考文献,纹理内存,126–127,131–137
references, texture memory, 126–127, 131–137
登记
registration
bufferObj与cudaGraphicsGLRegisterBuffer(), 151
bufferObj with cudaGraphicsGLRegisterBuffer(), 151
回调, 149
callback, 149
渲染,GPU 执行复杂操作,139
rendering, GPUs performing complex, 139
resource多变的
resource variable
创造GPUAnimBitmap,148–152
creating GPUAnimBitmap, 148–152
图形互操作,141
graphics interoperation, 141
资源,在线
resources, online
CUDA 代码,246–248
CUDA code, 246–248
CUDA 工具包,16
CUDA Toolkit, 16
CUDA 大学,245
CUDA University, 245
CUDPP,246
CUDPP, 246
CULA 工具,246
CULAtools, 246
Dobb 博士的 CUDA,246
Dr. Dobb’s CUDA, 246
GPU 计算 SDK 代码示例,18
GPU Computing SDK code samples, 18
语言包装,246–247
language wrappers, 246–247
NVIDIA 设备驱动程序,16
NVIDIA device driver, 16
NVIDIA 论坛,246
NVIDIA forums, 246
Mac OS X 的标准 C 编译器,19
standard C compiler for Mac OS X, 19
Visual Studio C 编译器,18
Visual Studio C compiler, 18
CUDA U,245–246
CUDA U, 245–246
论坛, 246
forums, 246
对大规模并行处理器进行编程,244–245
programming massive parallel processors, 244–245
纹波,GPU
ripple, GPU
具有图形互操作性,147–154
with graphics interoperability, 147–154
生产, 69–74
producing, 69–74
routine()
routine()
分配便携式固定内存,232–234
allocating portable pinned memory, 232–234
使用多个 CPU,226–228
using multiple CPUs, 226–228
俄罗斯嵌套娃娃等级制度,164
Russian nesting doll hierarchy, 164
可扩展链接接口 (SLI),添加多个 GPU,224
scalable link interface (SLI), adding multiple GPUs with, 224
scale系数,CPU Julia 集,49
scale factor, CPU Julia Set, 49
早期的科学计算6
scientific computations, in early days, 6
截图
screenshots
动画传热模拟,126
animated heat transfer simulation, 126
GPU Julia 设置示例,57
GPU Julia Set example, 57
GPU 纹波示例,74
GPU ripple example, 74
图形互操作示例,147
graphics interoperation example, 147
光线追踪示例,103–104
ray tracing example, 103–104
通过正确的同步渲染,93
rendered with proper synchronization, 93
在没有正确同步的情况下渲染,92
rendered without proper synchronization, 92
着色语言,6
shading languages, 6
共享数据缓冲区,内核/OpenGL 渲染 - 互操作,142
shared data buffers, kernel/OpenGL rendering -interoperation, 142
共享内存
shared memory
位图,90–93
bitmap, 90–93
CUDA 架构创建对7 的访问
CUDA Architecture creating access to, 7
点积,76–87
dot product, 76–87
点积优化不正确,87–90
dot product optimized incorrectly, 87–90
和同步,75
and synchronization, 75
Silicon Graphics,OpenGL 库,4
Silicon Graphics, OpenGL library, 4
模拟
simulation
动画,121–125
animation of, 121–125
体力挑战,117
challenges of physical, 117
计算温度更新,119–121
computing temperature updates, 119–121
简单加热模型,117–118
simple heating model, 117–118
SLI(可扩展链接接口),添加多个 GPU,224
SLI (scalable link interface), adding multiple GPUs with, 224
空间局部性
spatial locality
为图形设计纹理缓存,116
designing texture caches for graphics with, 116
传热模拟动画,125–126
heat transfer simulation animation, 125–126
分割并行块。查看并行块,分成线程
split parallel blocks. see parallel blocks, splitting into threads
标准C编译器
standard C compiler
编译最低计算能力,167–168
compiling for minimum compute capability, 167–168
开发环境,18–19
development environment, 18–19
内核调用,23–24
kernel call, 23–24
start事件,108–110
start event, 108–110
start_thread(),多个 CPU,226–227
start_thread(), multiple CPUs, 226–227
stop事件,108–110
stop event, 108–110
溪流
streams
CUDA,概述,192
CUDA, overview of, 192
CUDA,使用多个,198–205,208–210
CUDA, using multiple, 198–205, 208–210
CUDA,使用单个,192–198
CUDA, using single, 192–198
GPU 工作调度和,205–208
GPU work scheduling and, 205–208
概述,185–186
overview of, 185–186
页锁定主机内存和,186–192
page-locked host memory and, 186–192
总结回顾,211
summary review, 211
超级计算机,性能提升,3
supercomputers, performance gains in, 3
表面活性剂,环境破坏,10
surfactants, environmental devastation of, 10
同步
synchronization
的事件。看 cudaEventSynchronize()
of events. see cudaEventSynchronize()
线程数,219
of threads, 219
同步和共享内存
synchronization, and shared memory
点积,76–87
dot product, 76–87
点积优化不正确,87–90
dot product optimized incorrectly, 87–90
概述, 75
overview of, 75
共享内存位图,90–93
shared memory bitmap, 90–93
__syncthreads()
__syncthreads()
dot product computation, 78–80, 85
使用共享内存位图,90–93
shared memory bitmap using, 90–93
意想不到的后果,87-90
unintended consequences of, 87–90
任务并行性,CPU 与 GPU 应用程序,185
task parallelism, CPU versus GPU applications, 185
TechniScan 医疗系统,CUDA 应用程序,9
TechniScan Medical Systems, CUDA applications, 9
温度
temperatures
计算温度更新,119–121
computing temperature updates, 119–121
传热模拟,117–118
heat transfer simulation, 117–118
传热模拟动画,121–125
heat transfer simulation animation, 121–125
天普大学研究,CUDA 应用,10–11
Temple University research, CUDA applications, 10–11
tex1Dfetch()编译器内在函数、纹理内存、127–128、131–132
tex1Dfetch() compiler intrinsic, texture memory, 127–128, 131–132
tex2D()编译器内在函数,纹理内存,132–133
tex2D() compiler intrinsic, texture memory, 132–133
纹理,GPU 计算的早期,5-6
texture, early days of GPU computing, 5–6
纹理记忆
texture memory
模拟动画,121–125
animation of simulation, 121–125
定义, 115
defined, 115
概述,115–117
overview of, 115–117
模拟传热,117–121
simulating heat transfer, 117–121
总结回顾,137
summary review, 137
二维,131–137
two-dimensional, 131–137
使用,125–131
using, 125–131
2D 纹理内存,132–133
2D texture memory, 132–133
dot product computation, 76–77, 85
使用原子锁进行点积计算,255–256
dot product computation with atomic locks, 255–256
GPU 哈希表实现,272
GPU hash table implementation, 272
GPU 朱莉娅套装,52
GPU Julia Set, 52
使用线程的 GPU 波动,72–73
GPU ripple using threads, 72–73
较长向量的 GPU 总和,63–64
GPU sums of a longer vector, 63–64
任意长向量的 GPU 总和,66–67
GPU sums of arbitrarily long vectors, 66–67
使用线程的 GPU 向量和,61
GPU vector sums using threads, 61
使用全局内存原子的直方图内核,179–180
histogram kernel using global memory atomics, 179–180
使用共享/全局内存原子的直方图内核,182–183
histogram kernel using shared/global memory atomics, 182–183
多个 CUDA 流,200
multiple CUDA streams, 200
GPU 上的光线追踪,102
ray tracing on GPU, 102
设置图形互操作性,145
setting up graphics interoperability, 145
共享内存位图,91
shared memory bitmap, 91
温度更新计算,119–121
temperature update computation, 119–121
零拷贝内存点积,221
zero-copy memory dot product, 221
线程
threads
编码,38–41
coding with, 38–41
恒定记忆和,106–107
constant memory and, 106–107
GPU 纹波使用,69–74
GPU ripple using, 69–74
较长向量的 GPU 总和,63–65
GPU sums of a longer vector, 63–65
任意长向量的 GPU 总和,65–69
GPU sums of arbitrarily long vectors, 65–69
GPU 向量和使用,61–63
GPU vector sums using, 61–63
硬件数量限制,63
hardware limit to number of, 63
使用全局内存原子的直方图内核,179–181
histogram kernel using global memory atomics, 179–181
不正确的点积优化和发散,89
incorrect dot product optimization and divergence of, 89
多个 CPU,225–229
multiple CPUs, 225–229
概述,59–60
overview of, 59–60
GPU 上的光线追踪,102–104
ray tracing on GPU and, 102–104
读-修改-写操作,168–170
read-modify-write operations, 168–170
共享内存和。查看共享内存
shared memory and. see shared memory
总结回顾,94
summary review, 94
同步,219
synchronizing, 219
threadsPerBlock
threadsPerBlock
分配共享内存,76–77
allocating shared memory, 76–77
点积计算,79–87
dot product computation, 79–87
三维块,较长向量的 GPU 总和,63
three-dimensional blocks, GPU sums of a longer vector, 63
三维图形,GPU 的历史,4–5
three-dimensional graphics, history of GPUs, 4–5
三维场景,光线追踪生成 2D 图像,97
three-dimensional scenes, ray tracing producing 2-D image of, 97
tid多变的
tid variable
blockIdx.x变量赋值,44
blockIdx.x variable assigning value of, 44
检查它是否小于N, 45–46
checking that it is less than N, 45–46
点积计算,77–78
dot product computation, 77–78
在多个 CPU 上并行化代码,40
parallelizing code on multiple CPUs, 40
时间,使用线程的 GPU 波动,72–74
time, GPU ripple using threads, 72–74
计时器、事件。看 cudaEventElapsedTime()
timer, event. see cudaEventElapsedTime()
工具包,CUDA,16–18
Toolkit, CUDA, 16–18
二维块
two-dimensional blocks
块和线程的排列,64
arrangement of blocks and threads, 64
GPU 朱莉娅套装,51
GPU Julia Set, 51
使用线程的 GPU 纹波,70
GPU ripple using threads, 70
gridDim变量为,63
gridDim variable as, 63
一维索引对比,44
one-dimensional indexing versus, 44
二维显示加速器、GPU的开发、4
two-dimensional display accelerators, development of GPUs, 4
二维纹理记忆
two-dimensional texture memory
定义,116
defined, 116
传热模拟,117–118
heat transfer simulation, 117–118
概述,131–137
overview of, 131–137
超声成像、CUDA 应用程序,9
ultrasound imaging, CUDA applications for, 9
统一着色器管道,CUDA 架构,7
unified shader pipeline, CUDA Architecture, 7
大学,CUDA,245
university, CUDA, 245
价值观
values
CPU 哈希表实现,261–267
CPU hash table implementation, 261–267
GPU 哈希表实现,269–275
GPU hash table implementation, 269–275
哈希表概念,259–260
hash table concepts, 259–260
矢量点积。请参阅点积计算
vector dot products. see dot product computation
向量和
vector sums
中央处理器,39–41
CPU, 39–41
图形处理器,41–46
GPU, 41–46
任意长向量的 GPU 总和,65–69
GPU sums of arbitrarily long vectors, 65–69
较长向量的 GPU 总和,63–65
GPU sums of longer vector, 63–65
使用线程的 GPU 总和,61–63
GPU sums using threads, 61–63
verify_table(),GPU哈希表,270
verify_table(), GPU hash table, 270
视觉分析器,NVIDIA,243–244
Visual Profiler, NVIDIA, 243–244
Visual Studio C 编译器,18–19
Visual Studio C compiler, 18–19
扭曲,读取恒定内存,106–107
warps, reading constant memory with, 106–107
while()环形
while() loop
CPU 向量和,40
CPU vector sums, 40
GPU锁定功能,253
GPU lock function, 253
工作调度,GPU,205–208
work scheduling, GPU, 205–208
零拷贝内存
zero-copy memory
分配/使用,214–222
allocating/using, 214–222
定义, 214
defined, 214
性能,222–223
performance, 222–223